Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDATA codegen updates, part 3 #33

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

silentbicycle
Copy link

Some further updates, for a new Device Detection canary. This still isn't quite ready to go upstream, but it's getting closer.

Changes:

  • Move .end and .endid_offset fields from the state struct table to a separate table. Those fields aren't used until the end of input, moving them out makes the state table more dense and improves locality.
  • Intern individual words from each state's 256-bitsets: These tend to be duplicated across several states, so if DFAs store them via four uint8_t or uint16_t offsets into a shared word table the overall binary ends up considerably smaller. They are sorted by use (descending), so the most frequently used ones are likely to stay in cache.
  • Fix a memory leak -- already merged upstream, but has not been synced here: Fix a memory leak during fsm_determinise_with_config's early exit katef/libfsm#504
  • Buffer eager_outputs until a successful match, rather than immediately writing them into the caller's buffer.

These fields are only used at the end of input, and moving them into a
different struct will make the per-state data accessed during DFA
execution more compact.
Each state has two 256-bitsets, stored as a uint64_t[4], but the
individual words in those have a lot of duplication. Add a table
with every unique word, sorted descending by frequency, and replace
the per-state labels and label_group_starts arrays with an array of
offsets into the label_word table. Typically these offsets will
fit in a uint8_t (though the code generation will switch to
a uint16_t when necessary), making the per-state data much smaller.
The label_word table's most commonly used entries are all grouped
together and should stay in cache.
The edge sets leak when halting with FSM_DETERMINISE_WITH_CONFIG_STATE_LIMIT_REACHED.
Previously, the CDATA codegen wrote eager_outputs directly into the
caller's match bit buffer as they were encountered. Instead, set them in
a stack-allocated buffer, and then copy them to the caller's if the DFA
match succeeds overall.

In order to avoid repeatedly checking for whether an eager_output has
already been set (in the buffer), this collects the set of all distinct
eager_output IDs and then remaps the array with eager output IDs to
offsets into the unique set. This condenses the (sparse) set
into a dense series 0..n that can be represented by flags in a
stack-allocated bit vector (with a size known at compile time), and
redundant eager_outputs harmlessly set flag bits that are already set.
If the overall match succeeds, that bit vector is matched up with the
unique ID array and the sparse values are written into the caller's
buffer.

Because the unique ID array is sorted, the relative ordering of the
sparse and dense IDs is preserved (and 0 stays 0), so using
non-ascending values as terminators still works.
@silentbicycle silentbicycle requested review from cxreg and katef December 9, 2024 17:56
Copy link

@deg4uss3r deg4uss3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly I am coming into the code base pretty cold and C isn't my most preferred language but I do not see anything here that would prevent this from being approved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants