The most efficient way to search multiple texts with long common prefixs #1199
-
Basically, I need to find leftmost match with untrusted regular expressions on multiple texts with long common prefixes. The simplest method is to run the search for each individual text, which unfortunately means repeatedly searching the common prefix and significant overhead in my use case. I checked |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
The full DFA is definitely ideal here in the sense that you can stop and start it in arbitrary states. So I can see how that would help with the longest common prefix here. And the full DFA is really the only engine capable of that. It might be possible to do the same with the PikeVM if you could abandon captures, but a "state" in that case wouldn't just be a single state ID, but an ordered set of state IDs. And of course, the PikeVM is quite a bit slower than the full DFA. With that said... you said the magic words here:
And then here:
Yes indeed. You should be worried. If the regexes are truly untrusted, then building a full DFA from them is a really really bad idea. It wouldn't take much to DoS you. The docs even have examples of very short regexes that will use exponential time. And of course, you'll have to worry about memory exhaustion too.
Dunno really. Given what you've told me as hard constraints, I'd probably just sacrifice the perf, depending on how much it is. It'd still be correct but maybe not the fastest possible thing. If every ounce of perf is critical, then you might be in "bespoke regex engine" land. And I'm not really sure where I'd start. The "untrusted regex" part is really the killer here. Untrusted regexes are usually are very very bad idea. I recognize they are occasionally useful, but you need to be really careful. One possible idea here is to compile the untrusted regexes to DFAs in a carefully controlled sandbox with modest resource limits. If you can do that, then it's plausible your approach can work. |
Beta Was this translation helpful? Give feedback.
The full DFA is definitely ideal here in the sense that you can stop and start it in arbitrary states. So I can see how that would help with the longest common prefix here. And the full DFA is really the only engine capable of that. It might be possible to do the same with the PikeVM if you could abandon captures, but a "state" in that case wouldn't just be a single state ID, but an ordered set of state IDs. And of course, the PikeVM is quite a bit slower than the full DFA.
With that said... you said the magic words here:
And then here:
Yes indeed. You should be worried. …