-
-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate combinable DFA capture resolution for our PCRE dialect regexes, part 1. #440
base: sv/integration-target--combinable-DFA-capture-resolution
Are you sure you want to change the base?
Commits on May 1, 2023
-
Configuration menu - View commit details
-
Copy full SHA for db702ff - Browse repository at this point
Copy the full SHA db702ffView commit details -
Switch adt code to
uint64_t
for hashes, notunsigned long
.Remove include/adt/mappingset.h, it's no longer used.
Configuration menu - View commit details
-
Copy full SHA for 41a92d7 - Browse repository at this point
Copy the full SHA 41a92d7View commit details
Commits on May 2, 2023
-
stateset: Add
EXPENSIVE_CHECKS
guard around expensive asserts.This and src/libfsm/internal.h's EXPENSIVE_CHECKS should move to a common place later.
Configuration menu - View commit details
-
Copy full SHA for ec30958 - Browse repository at this point
Copy the full SHA ec30958View commit details -
Namespace epsilon_closure and closure_free with "fsm_".
These symbols are exported in the library.
Configuration menu - View commit details
-
Copy full SHA for d1ce686 - Browse repository at this point
Copy the full SHA d1ce686View commit details
Commits on May 3, 2023
-
tests/capture/: Delete outdated capture tests.
These rely on either direct FSM construction or setting captures via a capture path, but captures are now implemented by a completely different mechanism. New tests will be added with the capture code in a future commit.
Configuration menu - View commit details
-
Copy full SHA for deb14a8 - Browse repository at this point
Copy the full SHA deb14a8View commit details
Commits on Jun 2, 2023
-
Configuration menu - View commit details
-
Copy full SHA for ffea22e - Browse repository at this point
Copy the full SHA ffea22eView commit details -
Complemely rework capture resolution.
This is a big commit, unfortunately difficult to break apart further due to interface changes, metadata being passed through whole-FSM transformations, and so on. Sorry about that. - Delete code related to capture action metadata on edges. That approach made FSM transformations (determinisation, minimisation, etc.) considerably more expensive, and there were some corner cases that I wasn't able to get working correctly. - Switch to a somewhat simpler method, adapted from Russ Cox's "Regular Expression Matching: the Virtual Machine Approach". Since the capture resolution metadata (an opcode program for a virtual machine) is associated with individual end states, this combines cleanly when multiple regexes are unioned into a single large DFA that matches them all at once. - Add lots of capture regression tests, mostly from using libfsm's `fsm_generate_matches` and a fuzzer to compare behavior against PCRE. This brought many, many obscure cases to light. The updated fuzzer harness will appear in a later commit.
Configuration menu - View commit details
-
Copy full SHA for 2beeff4 - Browse repository at this point
Copy the full SHA 2beeff4View commit details -
Configuration menu - View commit details
-
Copy full SHA for 31ff234 - Browse repository at this point
Copy the full SHA 31ff234View commit details -
fuzz/target.c: Add multi-regex and single regex cmp against PCRE.
Add a couple more execution modes to the fuzzer test harness. In particular, add support for comparing match and capture behavior against PCRE's -- because this depends on linking with libpcre, it's disabled by defalut. Set `PCRE_CMP` in the Makefile to enable it, or pass it as a build argument.
Configuration menu - View commit details
-
Copy full SHA for 77f9260 - Browse repository at this point
Copy the full SHA 77f9260View commit details -
re: Add -FC flag to disable captures.
I'm adding this because many of the existing tests under `tests/pcre-anchor/` and so on contain regexes that would now be rejected as unsupported in combination with captures, but they are testing cases unrelated to capturing.
Configuration menu - View commit details
-
Copy full SHA for 318c9ad - Browse repository at this point
Copy the full SHA 318c9adView commit details -
tests/*/Makefile: Add
-FC
(no captures) for some calls to RE.There are several tests that have nothing to do with captures, capture behavior is tested directly with `tests/captures/`.
Configuration menu - View commit details
-
Copy full SHA for 5e49713 - Browse repository at this point
Copy the full SHA 5e49713View commit details -
ast_analysis: Reject '\z' as unsupported.
We currently get the wrong capture result for it.
Configuration menu - View commit details
-
Copy full SHA for 6e410b3 - Browse repository at this point
Copy the full SHA 6e410b3View commit details
Commits on Jun 15, 2023
-
Merge branch 'main' into sv/integrate-combinable-DFA-capture-resolution
There were modifications to metadata associated with end states on both sides. A few other changes will appear in separate commits. Conflicts: include/fsm/fsm.h src/libfsm/consolidate.c src/libfsm/endids.c src/libfsm/merge.c
Configuration menu - View commit details
-
Copy full SHA for 5ddb4f0 - Browse repository at this point
Copy the full SHA 5ddb4f0View commit details -
union.c: Remove EXPENSIVE_CHECKS based on removed interface.
Now instead of exposing exactly how many captures the fsm has, we keep track of the ceiling of the count, to track how large the capture buffer needs to be. We could add this back if fsm_capture_count gets re-added.
Configuration menu - View commit details
-
Copy full SHA for 86ec9e4 - Browse repository at this point
Copy the full SHA 86ec9e4View commit details -
bugfix: resize endid buffer for carry_end_metadata properly.
Previously we doubled the buffer if it wasn't large enough, but the next endid array may be > twice the size of the old buffer, so we need to keep expanding until it's large enough. We weren't saving the updated array size, so this could potentially lead to repeated doubling and eventually allocation failures. Also, change the assert for fsm_getendids's result -- if we get to that point, it should always be found, not just not an insufficient space error.
Configuration menu - View commit details
-
Copy full SHA for d54ba0d - Browse repository at this point
Copy the full SHA d54ba0dView commit details -
Address a couple warnings from scan-build.
determinise: It's not possible to find a cached result in the hash table without allocating a to-set buffer first, so assert that it will be non-NULL. fsm_findmode: This should never be used on a state without edges. vm/v1.c and vm/v2.c: Free allocated return value on error.
Configuration menu - View commit details
-
Copy full SHA for e2ac4d7 - Browse repository at this point
Copy the full SHA e2ac4d7View commit details -
fuzz/target.c: In MULTI mode, check endid behavior.
For each pattern 0..n that will be combined, set an endid on them. Then, generate inputs that match, and check that the endid result on the single and combined FSMs are consistent.
Configuration menu - View commit details
-
Copy full SHA for c11f372 - Browse repository at this point
Copy the full SHA c11f372View commit details
Commits on Jul 10, 2023
-
minimisation: Distinct sets of end IDs should not split ECs.
In minimise.c, `split_ecs_by_end_metadata` does a first pass splitting ECs based on their end data (like the name says). This sets which end metadata should prevent states from starting out in the same EC, effectively which states can/cannot be combined once minimisation's analysis is done. Previously, distinct sets of end IDs would keep states from merging, but if epsilon removal and determinisation have led to end states with distinct sets of end IDs, that alone shouldn't prevent them from merging later -- the same end state just becomes associated with all those end IDs. We do prevent states with distinct capture ID sets from merging, but that's because of a few special cases like "^a(b)c$" and "^a(b*)c$", where combining partially overlapping regexes' end states could lead to false positives in the capture result. Note: I added checking the program IDs (which was missing) to `same_end_metadata`. This seems correct to me, but at the moment I can't think of any test inputs that would lead to the same sets of capture IDs but distinct sets of program IDs. I will see if fuzzing can find any. This is tested by tests/endids/endids2_union_many_endids.c and the new multi test in tests/capture/capture_test_case_list.c.
Configuration menu - View commit details
-
Copy full SHA for 8b49a1a - Browse repository at this point
Copy the full SHA 8b49a1aView commit details -
Configuration menu - View commit details
-
Copy full SHA for a3a79ca - Browse repository at this point
Copy the full SHA a3a79caView commit details
Commits on Jul 11, 2023
-
capture_vm_exec: Address scan-build warnings.
There were several unused store warnings about values that were only set for logging, either `(void)x` them out or restructure the code so that their scope is more limited. Remove `next_offset` from `populate_solution`, since that isn't being used at all. IIRC it was added before some of the path sharing details made it unnecessary.
Configuration menu - View commit details
-
Copy full SHA for bfc9a80 - Browse repository at this point
Copy the full SHA bfc9a80View commit details
Commits on Jul 17, 2023
-
fuzz/target.c: Expand build_and_check_multi.
Also compare in the other direction, generating matching inputs from the combined DFA and then check that individual regexes's captures still match as expected.
Configuration menu - View commit details
-
Copy full SHA for f43585a - Browse repository at this point
Copy the full SHA f43585aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 3a93f6d - Browse repository at this point
Copy the full SHA 3a93f6dView commit details -
Remove nullable ALT backpatching, it's now unreachable.
This was set up to handle edge cases like `^(($)|x)+$` where there is an alt with some nullable cases with anchors, surrounded by + repetition, which turned up during fuzzing. This is an obscure edge case, it is proving very difficult to handle correctly, and there is probably little value in actually doing so. Now we are flagging it as an unsupported PCRE construct in ast_analysis.c's analysis_iter_repetition, so the code using repeated_alt_backpatches is unreachable. Just remove it.
Configuration menu - View commit details
-
Copy full SHA for da51d9e - Browse repository at this point
Copy the full SHA da51d9eView commit details -
Configuration menu - View commit details
-
Copy full SHA for 9f3ae29 - Browse repository at this point
Copy the full SHA 9f3ae29View commit details
Commits on Jul 18, 2023
-
re_capvm_compile: Change "active_node" categorization.
Strictly speaking this shouldn't include nodes that have been flagged with `AST_FLAG_UNSATISFIABLE`. I have re-fuzzed with this changed and it does not seem to have introduced any new issues. `active_node` is only used in one place.
Configuration menu - View commit details
-
Copy full SHA for 26483cf - Browse repository at this point
Copy the full SHA 26483cfView commit details
Commits on Jul 21, 2023
-
capture_vm: rename split's .cont and .new to .greedy and .nongreedy.
Rather than commenting throughout to note that .cont is the greedy branch and .new is non-greedy, just rename them.
Configuration menu - View commit details
-
Copy full SHA for e9353af - Browse repository at this point
Copy the full SHA e9353afView commit details -
Configuration menu - View commit details
-
Copy full SHA for 1e55220 - Browse repository at this point
Copy the full SHA 1e55220View commit details
Commits on Jul 24, 2023
-
Merge branch 'main' into sv/integrate-combinable-DFA-capture-resolution
Addressed some merge conflicts to fuzz/target.c.
Configuration menu - View commit details
-
Copy full SHA for 36dfddf - Browse repository at this point
Copy the full SHA 36dfddfView commit details -
Configuration menu - View commit details
-
Copy full SHA for 937c9f0 - Browse repository at this point
Copy the full SHA 937c9f0View commit details -
Configuration menu - View commit details
-
Copy full SHA for 5e54a8b - Browse repository at this point
Copy the full SHA 5e54a8bView commit details
Commits on Jul 25, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 94d6e7f - Browse repository at this point
Copy the full SHA 94d6e7fView commit details -
re_capvm_compile.c: Avoid potential sign-extension bug.
`expr->u.literal.c` is signed, so when negative casting it to `uint64_t` can lead to a 64-bit value > 255. Clamp it to 8 bits.
Configuration menu - View commit details
-
Copy full SHA for dced7c7 - Browse repository at this point
Copy the full SHA dced7c7View commit details -
Do non-zero allocations to silence EFENCE.
EFENCE breaks CI builds that (correctly) do a zero-size allocation when calling `calloc` with a count of zero, reporting that it's a likely bug. Use a size of 1 instead of 0 to silence it.
Configuration menu - View commit details
-
Copy full SHA for 8f0a683 - Browse repository at this point
Copy the full SHA 8f0a683View commit details -
re_capvm_compile: subtree_represents_character_class: handle empty b.
The comment says this handles a .b that is EMPTY, but the recursive call rejects it. Something in PCRE suite in CI is triggering this.
Configuration menu - View commit details
-
Copy full SHA for 722260a - Browse repository at this point
Copy the full SHA 722260aView commit details
Commits on Sep 11, 2023
-
state_set: Improve
state_set_search
performance, correct result.The description says "Return where an item would be, if it were inserted", but it was returning the last element <= rather than the first element >=, then the call to `state_set_cmpval` later was shifting i by 1 for that specific case. Handle it correctly inside the search function instead. Two other all call sites need to check whether the result refers to the append position (one past the end of the array) before checking `set->a[i] == state`, update them. Add a fast path upfront: It's VERY common to append states in order to the state array, so before we binary search each first compare against the last entry (unless empty).
Configuration menu - View commit details
-
Copy full SHA for 75bb88d - Browse repository at this point
Copy the full SHA 75bb88dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 709b8cc - Browse repository at this point
Copy the full SHA 709b8ccView commit details -
stateset: Add note about potentially expensive assertion.
In -O0 this can become pretty expensive (~25% of overall runtime for `time ./re -rpcre -C '^[ab]{0,2000}$'`), but when built with -O3 very little overhead remains. I'm adding this comment because every time I see this it seems to me like it should have `EXPENSIVE_CHECKS` around it, but profiling is telling me it really doesn't matter.
Configuration menu - View commit details
-
Copy full SHA for cead0d9 - Browse repository at this point
Copy the full SHA cead0d9View commit details -
Configuration menu - View commit details
-
Copy full SHA for cbfeddd - Browse repository at this point
Copy the full SHA cbfedddView commit details -
Configuration menu - View commit details
-
Copy full SHA for c3dab77 - Browse repository at this point
Copy the full SHA c3dab77View commit details -
edgeset: Switch from linear to binary searching in edge_set_add_bulk.
This is a major hotspot when doing epsilon removal over large runs of potentially skipped states (as might appear from `^[ab]{0,2000}$`). Add a fast path for appending, which is also very common. Extract the edge set destination search into its own function, `find_state_position`, and add a `#define` to switch between linear search, binary search, or calling both and comparing the result. I will remove linear search in the next commit, but am checking this in as an intermediate step for checking & benchmarking.
Configuration menu - View commit details
-
Copy full SHA for 1ada07c - Browse repository at this point
Copy the full SHA 1ada07cView commit details -
edgeset: Commit to using binary search.
When I run `time ./re -rpcre -C '^[ab]{0,2000}$'` locally for -O3: - linear search: 2.991s - binary search: 1.521s
Configuration menu - View commit details
-
Copy full SHA for 7122d2f - Browse repository at this point
Copy the full SHA 7122d2fView commit details -
determinise: Drastically reduce calls to qsort.
After the other changes in this PR, calls to qsort from `sort_and_dedup_dst_buf` are one of the largest remaining hotspots in the profile. We can often avoid calling qsort, though: - If there is <= 1 entry, just return, it's sorted. - Otherwise, first do a sweep through the array noting the min and max values. Unless there is a huge range between them, it's much faster to build a bitset from them in a small (max 10KB) stack-allocated array and then unpack the bitset (now sorted and unique). Only the needed portion of the array is initialized. I have not done a lot of experimentation to find a cutoff point where the bitset becomes slower than qsort (it may be much larger), I picked 10KB because it's likely to be safe to stack-allocate. I tried changing the bitset unpacking to use an 8 or 16 bit mask and jump forward faster through large sub-word ranges of 0 bits, but any improvement was lost among random variation, so I decided it wasn't worth the extra complexity. We already skip whole words that are 0.
Configuration menu - View commit details
-
Copy full SHA for 937585a - Browse repository at this point
Copy the full SHA 937585aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 30e34ef - Browse repository at this point
Copy the full SHA 30e34efView commit details -
Configuration menu - View commit details
-
Copy full SHA for cf6051f - Browse repository at this point
Copy the full SHA cf6051fView commit details -
Configuration menu - View commit details
-
Copy full SHA for c1e1282 - Browse repository at this point
Copy the full SHA c1e1282View commit details -
bugfix: The range is min..max inclusive, so add 1.
If min and max are exactly 64 states apart the upper value was getting silently dropped due to an incorrect `words` value here. One of the patterns in the PCRE suite triggers this: ./re -rpcre '(?:c|d)(?:)(?:aaaaaaaa(?:)(?:bbbbbbbb)(?:bbbbbbbb(?:))(?:bbbbbbbb(?:)(?:bbbbbbbb)))' "caaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb" This should match, but did not.
Configuration menu - View commit details
-
Copy full SHA for 6eff0f9 - Browse repository at this point
Copy the full SHA 6eff0f9View commit details -
Address a couple warnings from scan-build.
determinise: It's not possible to find a cached result in the hash table without allocating a to-set buffer first, so assert that it will be non-NULL. fsm_findmode: This should never be used on a state without edges. vm/v1.c and vm/v2.c: Free allocated return value on error.
Configuration menu - View commit details
-
Copy full SHA for 98ee906 - Browse repository at this point
Copy the full SHA 98ee906View commit details -
Configuration menu - View commit details
-
Copy full SHA for 51892e3 - Browse repository at this point
Copy the full SHA 51892e3View commit details -
Configuration menu - View commit details
-
Copy full SHA for 7c6644f - Browse repository at this point
Copy the full SHA 7c6644fView commit details -
Configuration menu - View commit details
-
Copy full SHA for c646868 - Browse repository at this point
Copy the full SHA c646868View commit details -
Configuration menu - View commit details
-
Copy full SHA for 0789d61 - Browse repository at this point
Copy the full SHA 0789d61View commit details
Commits on Sep 12, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 1ca3726 - Browse repository at this point
Copy the full SHA 1ca3726View commit details
Commits on Sep 27, 2023
-
Merge branch 'main' into sv/integrate-combinable-DFA-capture-resolution
There were merge conflicts, which had to do with the return type of idmap_iter's callback -- on `main` it was `void`, on the branch it was `int` and indicated whether iterations should continue or halt. I picked the `int` form, and updated `idmap_iter` so that it was actually checked, because it probably doesn't make sense to continue iterating when earlier steps have errored out.
Configuration menu - View commit details
-
Copy full SHA for 5ece637 - Browse repository at this point
Copy the full SHA 5ece637View commit details
Commits on Oct 2, 2023
-
ci.yml: Update fuzzer modes for CI.
- 'd' (default) no longer exists, it's now 'r' - add 's', 'i', 'M'
Configuration menu - View commit details
-
Copy full SHA for 239927c - Browse repository at this point
Copy the full SHA 239927cView commit details