Integrate combinable DFA capture resolution for our PCRE dialect regexes, part 1. #440

silentbicycle · 2023-07-24T21:04:22Z

This adds an extra compilation step for PCRE regexes, generating programs for a group capture resolution abstract machine. The implementation builds on the approach described in "Regular Expression Matching: the Virtual Machine Approach", with adaptations to reduce the runtime memory footprint, handle some edge cases the same way PCRE does, explicitly detect some other PCRE edge cases and reject them as unsupported, and continue supporting group captures as the resulting DFAs are combined with other DFAs into a single DFA that matches them all in one pass

I have fuzzed this code quite a lot, both comparing group capture results to PCRE's and comparing libfsm's handling of individual regexes' DFAs with the DFA produced my combining them. Many of the test cases in tests/capture/capture_test_case_list.c came from fuzzing. My working branch diverged from main for a while during this work, in particular the metadata associated with end states, but after integrating upstream changes I fuzzed it further. I think I integrated things properly, but it's something to be mindful of during review. Some unrelated issues discovered during fuzzing have already been posted as separate PRs.

Worst-case memory usage is proportional to the size of the capvm program's length (known at compile time) and the input length -- we record a bit for each branch taken while evaluating the regex (e.g. either greedily jumping to repeat a subexpression or non-greedily advancing in the regex), so memory usage will slowly increase as inputs are evaluated. Common path prefixes are shared between threads, and the total thread count (and divergence) is bounded by the size of the opcode program. As a special case, a long path prefix of all zero bits is collapsed down to a counter; this usually happens because of an unanchored start loop, where each zero bit represents a path continuing to advance through the input without starting the match yet.

This PR targets an integration branch because some work is not yet complete:

Some quick cleanup of CI issues -- the submodule's build config is more strict about variables that are set but unused (because their use is in logging code that compiles away).
A few pathological cases having to do with repetition in combination with anchors are rejected as unsupported, such as ^(()($)|x)+$. These are probably not worth the trouble to support. It also doesn't support the \z operator yet.
Some of the combinations of CLI flags for re aren't enabled yet, such as capture resolution for multiple files, and the re docs need updating for some new flags.
We should be able to calculate likely memory usage info at regex compile time, which could be used by the caller to stack-allocate and reuse a preallocated buffer rather than using dynamic allocation at runtime, but I haven't implemented that yet. The memory usage should small in practice, but it would be nice to completely eliminate runtime dynamic allocation. The path metadata table currently grows on demand, the other data structures are either fixed-size or the same length as the opcode count. Cases where it would hit the caller's memory limit for growing the path table are likely to hit PCRE's limits much sooner.
Code generation does not output the capture resolution abstract machine's opcodes yet. When using capture handling, libfsm's caller will need to either link a small implementation of the abstract machine or have its code generation produce a standalone implementation, but because that brings some architectural trade-offs I wanted to wait until this round has been reviewed.

Remove include/adt/mappingset.h, it's no longer used.

This and src/libfsm/internal.h's EXPENSIVE_CHECKS should move to a common place later.

These symbols are exported in the library.

These rely on either direct FSM construction or setting captures via a capture path, but captures are now implemented by a completely different mechanism. New tests will be added with the capture code in a future commit.

This is a big commit, unfortunately difficult to break apart further due to interface changes, metadata being passed through whole-FSM transformations, and so on. Sorry about that. - Delete code related to capture action metadata on edges. That approach made FSM transformations (determinisation, minimisation, etc.) considerably more expensive, and there were some corner cases that I wasn't able to get working correctly. - Switch to a somewhat simpler method, adapted from Russ Cox's "Regular Expression Matching: the Virtual Machine Approach". Since the capture resolution metadata (an opcode program for a virtual machine) is associated with individual end states, this combines cleanly when multiple regexes are unioned into a single large DFA that matches them all at once. - Add lots of capture regression tests, mostly from using libfsm's `fsm_generate_matches` and a fuzzer to compare behavior against PCRE. This brought many, many obscure cases to light. The updated fuzzer harness will appear in a later commit.

Add a couple more execution modes to the fuzzer test harness. In particular, add support for comparing match and capture behavior against PCRE's -- because this depends on linking with libpcre, it's disabled by defalut. Set `PCRE_CMP` in the Makefile to enable it, or pass it as a build argument.

I'm adding this because many of the existing tests under `tests/pcre-anchor/` and so on contain regexes that would now be rejected as unsupported in combination with captures, but they are testing cases unrelated to capturing.

There are several tests that have nothing to do with captures, capture behavior is tested directly with `tests/captures/`.

We currently get the wrong capture result for it.

There were modifications to metadata associated with end states on both sides. A few other changes will appear in separate commits. Conflicts: include/fsm/fsm.h src/libfsm/consolidate.c src/libfsm/endids.c src/libfsm/merge.c

Now instead of exposing exactly how many captures the fsm has, we keep track of the ceiling of the count, to track how large the capture buffer needs to be. We could add this back if fsm_capture_count gets re-added.

Previously we doubled the buffer if it wasn't large enough, but the next endid array may be > twice the size of the old buffer, so we need to keep expanding until it's large enough. We weren't saving the updated array size, so this could potentially lead to repeated doubling and eventually allocation failures. Also, change the assert for fsm_getendids's result -- if we get to that point, it should always be found, not just not an insufficient space error.

determinise: It's not possible to find a cached result in the hash table without allocating a to-set buffer first, so assert that it will be non-NULL. fsm_findmode: This should never be used on a state without edges. vm/v1.c and vm/v2.c: Free allocated return value on error.

For each pattern 0..n that will be combined, set an endid on them. Then, generate inputs that match, and check that the endid result on the single and combined FSMs are consistent.

In minimise.c, `split_ecs_by_end_metadata` does a first pass splitting ECs based on their end data (like the name says). This sets which end metadata should prevent states from starting out in the same EC, effectively which states can/cannot be combined once minimisation's analysis is done. Previously, distinct sets of end IDs would keep states from merging, but if epsilon removal and determinisation have led to end states with distinct sets of end IDs, that alone shouldn't prevent them from merging later -- the same end state just becomes associated with all those end IDs. We do prevent states with distinct capture ID sets from merging, but that's because of a few special cases like "^a(b)c$" and "^a(b*)c$", where combining partially overlapping regexes' end states could lead to false positives in the capture result. Note: I added checking the program IDs (which was missing) to `same_end_metadata`. This seems correct to me, but at the moment I can't think of any test inputs that would lead to the same sets of capture IDs but distinct sets of program IDs. I will see if fuzzing can find any. This is tested by tests/endids/endids2_union_many_endids.c and the new multi test in tests/capture/capture_test_case_list.c.

There were several unused store warnings about values that were only set for logging, either `(void)x` them out or restructure the code so that their scope is more limited. Remove `next_offset` from `populate_solution`, since that isn't being used at all. IIRC it was added before some of the path sharing details made it unnecessary.

Also compare in the other direction, generating matching inputs from the combined DFA and then check that individual regexes's captures still match as expected.

This was set up to handle edge cases like `^(($)|x)+$` where there is an alt with some nullable cases with anchors, surrounded by + repetition, which turned up during fuzzing. This is an obscure edge case, it is proving very difficult to handle correctly, and there is probably little value in actually doing so. Now we are flagging it as an unsupported PCRE construct in ast_analysis.c's analysis_iter_repetition, so the code using repeated_alt_backpatches is unreachable. Just remove it.

Strictly speaking this shouldn't include nodes that have been flagged with `AST_FLAG_UNSATISFIABLE`. I have re-fuzzed with this changed and it does not seem to have introduced any new issues. `active_node` is only used in one place.

Rather than commenting throughout to note that .cont is the greedy branch and .new is non-greedy, just rename them.

Addressed some merge conflicts to fuzz/target.c.

The description says "Return where an item would be, if it were inserted", but it was returning the last element <= rather than the first element >=, then the call to `state_set_cmpval` later was shifting i by 1 for that specific case. Handle it correctly inside the search function instead. Two other all call sites need to check whether the result refers to the append position (one past the end of the array) before checking `set->a[i] == state`, update them. Add a fast path upfront: It's VERY common to append states in order to the state array, so before we binary search each first compare against the last entry (unless empty).

In -O0 this can become pretty expensive (~25% of overall runtime for `time ./re -rpcre -C '^[ab]{0,2000}$'`), but when built with -O3 very little overhead remains. I'm adding this comment because every time I see this it seems to me like it should have `EXPENSIVE_CHECKS` around it, but profiling is telling me it really doesn't matter.

This is a major hotspot when doing epsilon removal over large runs of potentially skipped states (as might appear from `^[ab]{0,2000}$`). Add a fast path for appending, which is also very common. Extract the edge set destination search into its own function, `find_state_position`, and add a `#define` to switch between linear search, binary search, or calling both and comparing the result. I will remove linear search in the next commit, but am checking this in as an intermediate step for checking & benchmarking.

When I run `time ./re -rpcre -C '^[ab]{0,2000}$'` locally for -O3: - linear search: 2.991s - binary search: 1.521s

After the other changes in this PR, calls to qsort from `sort_and_dedup_dst_buf` are one of the largest remaining hotspots in the profile. We can often avoid calling qsort, though: - If there is <= 1 entry, just return, it's sorted. - Otherwise, first do a sweep through the array noting the min and max values. Unless there is a huge range between them, it's much faster to build a bitset from them in a small (max 10KB) stack-allocated array and then unpack the bitset (now sorted and unique). Only the needed portion of the array is initialized. I have not done a lot of experimentation to find a cutoff point where the bitset becomes slower than qsort (it may be much larger), I picked 10KB because it's likely to be safe to stack-allocate. I tried changing the bitset unpacking to use an 8 or 16 bit mask and jump forward faster through large sub-word ranges of 0 bits, but any improvement was lost among random variation, so I decided it wasn't worth the extra complexity. We already skip whole words that are 0.

If min and max are exactly 64 states apart the upper value was getting silently dropped due to an incorrect `words` value here. One of the patterns in the PCRE suite triggers this: ./re -rpcre '(?:c|d)(?:)(?:aaaaaaaa(?:)(?:bbbbbbbb)(?:bbbbbbbb(?:))(?:bbbbbbbb(?:)(?:bbbbbbbb)))' "caaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb" This should match, but did not.

determinise: It's not possible to find a cached result in the hash table without allocating a to-set buffer first, so assert that it will be non-NULL. fsm_findmode: This should never be used on a state without edges. vm/v1.c and vm/v2.c: Free allocated return value on error.

There were merge conflicts, which had to do with the return type of idmap_iter's callback -- on `main` it was `void`, on the branch it was `int` and indicated whether iterations should continue or halt. I picked the `int` form, and updated `idmap_iter` so that it was actually checked, because it probably doesn't make sense to continue iterating when earlier steps have errored out.

silentbicycle · 2023-09-27T18:41:21Z

I've merged the current main into this and fuzzed it for another 20 minutes, on 8 CPU cores.

katef · 2023-10-01T14:44:43Z

fuzz/target.c

+	MODE_REGEX,
+	MODE_REGEX_SINGLE_ONLY,
+	MODE_REGEX_MULTI_ONLY,
+	MODE_IDEMPOTENT_DET_MIN,


Can you add these new modes to ci.yml too please? where currently we have:

mode: [ m, p, d ]

Updated in 239927c.

katef · 2023-10-01T14:47:47Z

fuzz/target.c

+
+static int
+compare_with_pcre(const char *pattern, struct fsm *fsm)
+{


katef · 2023-10-01T14:50:26Z

fuzz/target.c

+			/* This triggers an "unreached" assertion in the parser.
+			 * It's already been reported (issue #386), but once the
+			 * fuzzer finds it, it will report it over and over.
+			 * Exit here so that the fuzzer considers it uninteresting. */


- 'd' (default) no longer exists, it's now 'r' - add 's', 'i', 'M'

There's extra error codes in #440 for regexes that aren't UNSATISFIABLE per se, but depend on particular corner cases in PCRE that probably aren't worth supporting in an automata-based implementation. Add a test case for one, tests/pcre/in48.re: ^a|$[^x]b* This is a tricky one to handle properly; according to PCRE it should match either "a<anything...>" OR "\n", but nothing else. The newline match is because $ is a non-input-consuming check that evaluation is either at the end of input, or at a newline immediately before the end. In this case `$[^x]b*` matches exactly one newline; it's equivalent to "$\n". This probably isn't worth supporting, but we can detect cases where a potential newline match appears after a $ and reject them as an unsupported PCRE behavior.

silentbicycle added 28 commits May 1, 2023 13:14

Add src/adt/idmap.c, a state -> ID set map.

db702ff

Switch adt code to uint64_t for hashes, not unsigned long.

41a92d7

Remove include/adt/mappingset.h, it's no longer used.

stateset: Add EXPENSIVE_CHECKS guard around expensive asserts.

ec30958

This and src/libfsm/internal.h's EXPENSIVE_CHECKS should move to a common place later.

Namespace epsilon_closure and closure_free with "fsm_".

d1ce686

These symbols are exported in the library.

tests/capture/: Delete outdated capture tests.

deb14a8

These rely on either direct FSM construction or setting captures via a capture path, but captures are now implemented by a completely different mechanism. New tests will be added with the capture code in a future commit.

fuzz/run_fuzzer: Run single seed file when given as argument.

ffea22e

Add pcre-anchor test for anchoring edge case.

31ff234

re: Add -FC flag to disable captures.

318c9ad

I'm adding this because many of the existing tests under `tests/pcre-anchor/` and so on contain regexes that would now be rejected as unsupported in combination with captures, but they are testing cases unrelated to capturing.

tests/*/Makefile: Add -FC (no captures) for some calls to RE.

5e49713

There are several tests that have nothing to do with captures, capture behavior is tested directly with `tests/captures/`.

ast_analysis: Reject '\z' as unsupported.

6e410b3

We currently get the wrong capture result for it.

union.c: Remove EXPENSIVE_CHECKS based on removed interface.

86ec9e4

Now instead of exposing exactly how many captures the fsm has, we keep track of the ceiling of the count, to track how large the capture buffer needs to be. We could add this back if fsm_capture_count gets re-added.

fuzz/target.c: In MULTI mode, check endid behavior.

c11f372

For each pattern 0..n that will be combined, set an endid on them. Then, generate inputs that match, and check that the endid result on the single and combined FSMs are consistent.

Remove theft test harness for deleted ADT (ipriq).

a3a79ca

fuzz/target.c: Expand build_and_check_multi.

f43585a

Also compare in the other direction, generating matching inputs from the combined DFA and then check that individual regexes's captures still match as expected.

re_capvm_compile.c: Update #include for EXPENSIVE_CHECKS.

3a93f6d

capture vm: Cleanup & clarifying comments throughout.

9f3ae29

capture_vm: rename split's .cont and .new to .greedy and .nongreedy.

e9353af

Rather than commenting throughout to note that .cont is the greedy branch and .new is non-greedy, just rename them.

capture_vm: Add error state for step limit reached.

1e55220

Merge branch 'main' into sv/integrate-combinable-DFA-capture-resolution

36dfddf

Addressed some merge conflicts to fuzz/target.c.

silentbicycle requested review from katef and sfstewman July 24, 2023 21:04

silentbicycle added 12 commits September 11, 2023 17:13

stateset: Avoid memmove of size 0.

709b8cc

stateset: Comment struct fields.

cbfeddd

edgeset: Fix indentation for #if'd block.

c3dab77

edgeset: Commit to using binary search.

7122d2f

When I run `time ./re -rpcre -C '^[ab]{0,2000}$'` locally for -O3: - linear search: 2.991s - binary search: 1.521s

edgeset: Remove stale comment.

30e34ef

UBSan: Avoid implicit signed/unsigned conversion.

cf6051f

UBSan: Avoid implicit signed/unsigned conversion.

c1e1282

This was referenced Sep 11, 2023

Integrate misc commits from #440 #442

Merged

Integrate PCRE capture handling fastly/libfsm#12

Closed

silentbicycle and others added 7 commits September 11, 2023 19:14

Add src/adt/idmap.c, a state -> ID set map.

51892e3

Remove theft test harness for deleted ADT (ipriq).

7c6644f

Add pcre-anchor test for anchoring edge case.

c646868

fuzz/run_fuzzer: Run single seed file when given as argument.

0789d61

Don't purge the seed cache for PRs syncing clones.

1ca3726

katef reviewed Oct 1, 2023

View reviewed changes

fuzz/target.c

static int

compare_with_pcre(const char *pattern, struct fsm *fsm)

{

Copy link

Owner

katef Oct 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fantastic

katef reviewed Oct 1, 2023

View reviewed changes

ci.yml: Update fuzzer modes for CI.

239927c

- 'd' (default) no longer exists, it's now 'r' - add 's', 'i', 'M'

silentbicycle mentioned this pull request Oct 19, 2023

Add error code for unsupported PCRE cases (RE_EUNSUPPPCRE), reject one. #447

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate combinable DFA capture resolution for our PCRE dialect regexes, part 1. #440

Integrate combinable DFA capture resolution for our PCRE dialect regexes, part 1. #440

silentbicycle commented Jul 24, 2023 •

edited

Loading

silentbicycle commented Sep 27, 2023

katef Oct 1, 2023

silentbicycle Oct 2, 2023

katef Oct 1, 2023

katef Oct 1, 2023

Integrate combinable DFA capture resolution for our PCRE dialect regexes, part 1. #440

Are you sure you want to change the base?

Integrate combinable DFA capture resolution for our PCRE dialect regexes, part 1. #440

Conversation

silentbicycle commented Jul 24, 2023 • edited Loading

silentbicycle commented Sep 27, 2023

katef Oct 1, 2023

Choose a reason for hiding this comment

silentbicycle Oct 2, 2023

Choose a reason for hiding this comment

katef Oct 1, 2023

Choose a reason for hiding this comment

katef Oct 1, 2023

Choose a reason for hiding this comment

silentbicycle commented Jul 24, 2023 •

edited

Loading