Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements for epsilon removal and determinisation #441

Merged

Conversation

silentbicycle
Copy link
Collaborator

@silentbicycle silentbicycle commented Aug 30, 2023

This PR improves performance for a couple code paths in epsilon removal and determinisation that become hotspots with large {} repeated groups:

  • Add a fast path upfront for when state_set_search would return the position after the array (appending), since this is very common.
  • Switch from linear to binary searching in edge_set_add_bulk.
  • Avoid unnecessary calls to qsort, when working space would be small (also very common) build a bitset on the stack instead.

I did some other experimentation with reworking the algorithm for epsilon_closure to avoid populating the closure with intermediate states on the transitive path to the epsilon closure endpoints, but in the end it didn't have anywhere near as much impact as the other changes, and it makes the epsilon closure function less general-purpose, so it's probably not worth the extra complexity. For epsilon removal we only need to add states to the closures that either have labeled edges or are an end state. Filtering anything else (potentially including states being in their own epsilon closure) reduces the state set sizes for further processing. Otherwise they get filtered out in a later step.

On my laptop, building with -O3, the difference between main and this PR's HEAD for time ./re_main -rpcre -C '^[ab]{0,5000}$':

  • main: real 0m41.328s
  • HEAD: real 0m5.167s

I chose that regex for benchmarking because it produces a very long chain of epsilon closures and a big edge set pileup.

The description says "Return where an item would be, if it were
inserted", but it was returning the last element <= rather than
the first element >=, then the call to `state_set_cmpval` later
was shifting i by 1 for that specific case. Handle it correctly
inside the search function instead. Two other all call sites
need to check whether the result refers to the append position
(one past the end of the array) before checking `set->a[i] == state`,
update them.

Add a fast path upfront: It's VERY common to append states in order
to the state array, so before we binary search each first compare
against the last entry (unless empty).
In -O0 this can become pretty expensive (~25% of overall runtime for
`time ./re -rpcre -C '^[ab]{0,2000}$'`), but when built with -O3 very
little overhead remains. I'm adding this comment because every time I
see this it seems to me like it should have `EXPENSIVE_CHECKS` around
it, but profiling is telling me it really doesn't matter.
This is a major hotspot when doing epsilon removal over large runs of
potentially skipped states (as might appear from `^[ab]{0,2000}$`).

Add a fast path for appending, which is also very common.

Extract the edge set destination search into its own function,
`find_state_position`, and add a `#define` to switch between linear
search, binary search, or calling both and comparing the result.
I will remove linear search in the next commit, but am checking this
in as an intermediate step for checking & benchmarking.
When I run `time ./re -rpcre -C '^[ab]{0,2000}$'` locally for -O3:
- linear search: 2.991s
- binary search: 1.521s
After the other changes in this PR, calls to qsort from
`sort_and_dedup_dst_buf` are one of the largest remaining hotspots in
the profile. We can often avoid calling qsort, though:

- If there is <= 1 entry, just return, it's sorted.

- Otherwise, first do a sweep through the array noting the min and max
  values. Unless there is a huge range between them, it's much faster to
  build a bitset from them in a small (max 10KB) stack-allocated array
  and then unpack the bitset (now sorted and unique). Only the needed
  portion of the array is initialized.

I have not done a lot of experimentation to find a cutoff point where
the bitset becomes slower than qsort (it may be much larger), I picked
10KB because it's likely to be safe to stack-allocate.

I tried changing the bitset unpacking to use an 8 or 16 bit mask and
jump forward faster through large sub-word ranges of 0 bits, but any
improvement was lost among random variation, so I decided it wasn't
worth the extra complexity. We already skip whole words that are 0.
@silentbicycle
Copy link
Collaborator Author

silentbicycle commented Aug 30, 2023

Along with a couple unrelated USBan warnings (fixed above to reduce noise) CI found a bug -- re -rpcre '(?:c|d)(?:)(?:aaaaaaaa(?:)(?:bbbbbbbb)(?:bbbbbbbb(?:))(?:bbbbbbbb(?:)(?:bbbbbbbb)))' should match "caaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb" but doesn't -- so I'm investigating that.

If min and max are exactly 64 states apart the upper value was getting
silently dropped due to an incorrect `words` value here.

One of the patterns in the PCRE suite triggers this:

    ./re -rpcre '(?:c|d)(?:)(?:aaaaaaaa(?:)(?:bbbbbbbb)(?:bbbbbbbb(?:))(?:bbbbbbbb(?:)(?:bbbbbbbb)))' "caaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"

This should match, but did not.
@silentbicycle
Copy link
Collaborator Author

silentbicycle commented Aug 31, 2023

Note: We can't use bsearch in state_set_search and edge_set_add_bulk because its interface doesn't match our use case -- bsearch returns NULL if the value isn't present, whereas we need the offset where the value would go if not present.

I tried calling out to bsearch anyway (and tracking time but ignoring the result) to compare timing, and bsearch took about 3.5x as long.

* which includes the position immediately following the last entry. Return an enum
* which indicates whether state is already present. */
static enum fsp_res
find_state_position(const struct edge_set *set, fsm_state_t state, size_t *dst)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use this in edge_set_find and edge_set_contains too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can binary search here because the struct edge_group array is sorted by .to and we're searching by destination state, but edge_set_find and edge_set_contains are searching by edge label. If they were frequently searched and it became a hotspot in the profile we could do linear-time reindexing and bsearch on those, but currently they don't even show up in the profile. As far as I can tell they're only even called from inside fsm_walk2 and the minimisation test oracle.

@katef katef merged commit 6eff0f9 into main Sep 11, 2023
322 checks passed
@katef katef deleted the sv/performance-improvements-for-epsilon-removal-and-determinisation branch September 11, 2023 16:13
@silentbicycle
Copy link
Collaborator Author

The bug I mentioned above (with (?:c|d)(?:)(?:aaaaaaaa(?:)(?:bbbbbbbb)(?:bbbbbbbb(?:))(?:bbbbbbbb(?:)(?:bbbbbbbb)))) is fixed by the commit immediately after (a58cab2). At the time a comment seemed redundant, but noting that just in case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants