CDATA codegen updates, part 3 #33

silentbicycle · 2024-12-09T17:56:31Z

Some further updates, for a new Device Detection canary. This still isn't quite ready to go upstream, but it's getting closer.

Changes:

Move .end and .endid_offset fields from the state struct table to a separate table. Those fields aren't used until the end of input, moving them out makes the state table more dense and improves locality.
Intern individual words from each state's 256-bitsets: These tend to be duplicated across several states, so if DFAs store them via four uint8_t or uint16_t offsets into a shared word table the overall binary ends up considerably smaller. They are sorted by use (descending), so the most frequently used ones are likely to stay in cache.
Fix a memory leak -- already merged upstream, but has not been synced here: Fix a memory leak during fsm_determinise_with_config's early exit katef/libfsm#504
Buffer eager_outputs until a successful match, rather than immediately writing them into the caller's buffer.

These fields are only used at the end of input, and moving them into a different struct will make the per-state data accessed during DFA execution more compact.

Each state has two 256-bitsets, stored as a uint64_t[4], but the individual words in those have a lot of duplication. Add a table with every unique word, sorted descending by frequency, and replace the per-state labels and label_group_starts arrays with an array of offsets into the label_word table. Typically these offsets will fit in a uint8_t (though the code generation will switch to a uint16_t when necessary), making the per-state data much smaller. The label_word table's most commonly used entries are all grouped together and should stay in cache.

The edge sets leak when halting with FSM_DETERMINISE_WITH_CONFIG_STATE_LIMIT_REACHED.

Previously, the CDATA codegen wrote eager_outputs directly into the caller's match bit buffer as they were encountered. Instead, set them in a stack-allocated buffer, and then copy them to the caller's if the DFA match succeeds overall. In order to avoid repeatedly checking for whether an eager_output has already been set (in the buffer), this collects the set of all distinct eager_output IDs and then remaps the array with eager output IDs to offsets into the unique set. This condenses the (sparse) set into a dense series 0..n that can be represented by flags in a stack-allocated bit vector (with a size known at compile time), and redundant eager_outputs harmlessly set flag bits that are already set. If the overall match succeeds, that bit vector is matched up with the unique ID array and the sparse values are written into the caller's buffer. Because the unique ID array is sorted, the relative ordering of the sparse and dense IDs is preserved (and 0 stays 0), so using non-ascending values as terminators still works.

deg4uss3r

Admittedly I am coming into the code base pretty cold and C isn't my most preferred language but I do not see anything here that would prevent this from being approved.

silentbicycle added 4 commits December 9, 2024 12:12

cdata: Move .end and (optional) .endid_offset into a separata array.

ed2e4d0

These fields are only used at the end of input, and moving them into a different struct will make the per-state data accessed during DFA execution more compact.

Fix memory leak.

af0ae65

The edge sets leak when halting with FSM_DETERMINISE_WITH_CONFIG_STATE_LIMIT_REACHED.

silentbicycle requested review from cxreg and katef December 9, 2024 17:56

deg4uss3r approved these changes Dec 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CDATA codegen updates, part 3 #33

CDATA codegen updates, part 3 #33

silentbicycle commented Dec 9, 2024

deg4uss3r left a comment

CDATA codegen updates, part 3 #33

Are you sure you want to change the base?

CDATA codegen updates, part 3 #33

Conversation

silentbicycle commented Dec 9, 2024

deg4uss3r left a comment

Choose a reason for hiding this comment