Feature proposal: find lexicographic bounds for an FSM #316

Udzu · 2021-02-16T15:46:52Z

Some data storage systems store records in name order, making it efficient to retrieve records with names in a given range. When filtering such records by regular expression, it might therefore be useful to have a way to generate a range in which all the results are guaranteed lie, to use as a pre-filter.

For example, any match of the regular expression (moomin|troll)+ must lie in the range ("moomin", "trolltrolm"). Note that the last character of the upper bound has been 'rounded up' to ensure that every match, no matter how long, lies before it.

This should be fairly straightforward to implement for a DFA, and only slightly more complex for a NFA (assuming it's been trimmed of any states that do not lead to a match). The interface will need to specify the maximum length of the bounds. The ordering can be lexicographic order on the byte encodings, which in the case of UTF8 also conveniently matches the ordering on Unicode code points. Still, there are a couple of potential UTF8 niggles:

the maximum length might lie inside a multi-byte character
rounding a truncated upper bound to the next byte might result in an invalid byte sequence

The easiest approach is probably to remain encoding-agnostic and simply ignore this, returning byte sequence bounds and letting the client library deal with converting those to valid strings.

The text was updated successfully, but these errors were encountered:

katef · 2021-03-04T02:45:35Z

This is really interesting, I'm not familiar with the theory here. Can you explain how you find the string trolltrolm from the minimal DFA for your example please?

katef added the enhancement label Aug 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature proposal: find lexicographic bounds for an FSM #316

Feature proposal: find lexicographic bounds for an FSM #316

Udzu commented Feb 16, 2021

katef commented Mar 4, 2021

Feature proposal: find lexicographic bounds for an FSM #316

Feature proposal: find lexicographic bounds for an FSM #316

Comments

Udzu commented Feb 16, 2021

katef commented Mar 4, 2021