The most efficient way to search multiple texts with long common prefixs #1199

Dan-wanna-M · 2024-06-05T04:54:33Z

Dan-wanna-M
Jun 5, 2024

Basically, I need to find leftmost match with untrusted regular expressions on multiple texts with long common prefixes. The simplest method is to run the search for each individual text, which unfortunately means repeatedly searching the common prefix and significant overhead in my use case. I checked regex-automata and the fully compiled DFA looks great , since it allows me to jump in the middle of a search with a cached state ID. The warning on exponential memory and time complexity is definitely worrying though. I would like to know whether there is a better approach. Thank you for spend time reading this discussion!

Answered by BurntSushi

Jun 5, 2024

The full DFA is definitely ideal here in the sense that you can stop and start it in arbitrary states. So I can see how that would help with the longest common prefix here. And the full DFA is really the only engine capable of that. It might be possible to do the same with the PikeVM if you could abandon captures, but a "state" in that case wouldn't just be a single state ID, but an ordered set of state IDs. And of course, the PikeVM is quite a bit slower than the full DFA.

With that said... you said the magic words here:

untrusted regular expressions

And then here:

The warning on exponential memory and time complexity is definitely worrying though.

Yes indeed. You should be worried. …

View full answer

BurntSushi · 2024-06-05T12:07:58Z

BurntSushi
Jun 5, 2024
Maintainer

The full DFA is definitely ideal here in the sense that you can stop and start it in arbitrary states. So I can see how that would help with the longest common prefix here. And the full DFA is really the only engine capable of that. It might be possible to do the same with the PikeVM if you could abandon captures, but a "state" in that case wouldn't just be a single state ID, but an ordered set of state IDs. And of course, the PikeVM is quite a bit slower than the full DFA.

With that said... you said the magic words here:

untrusted regular expressions

And then here:

The warning on exponential memory and time complexity is definitely worrying though.

Yes indeed. You should be worried. If the regexes are truly untrusted, then building a full DFA from them is a really really bad idea. It wouldn't take much to DoS you. The docs even have examples of very short regexes that will use exponential time. And of course, you'll have to worry about memory exhaustion too.

I would like to know whether there is a better approach.

Dunno really. Given what you've told me as hard constraints, I'd probably just sacrifice the perf, depending on how much it is. It'd still be correct but maybe not the fastest possible thing. If every ounce of perf is critical, then you might be in "bespoke regex engine" land. And I'm not really sure where I'd start.

The "untrusted regex" part is really the killer here. Untrusted regexes are usually are very very bad idea. I recognize they are occasionally useful, but you need to be really careful. One possible idea here is to compile the untrusted regexes to DFAs in a carefully controlled sandbox with modest resource limits. If you can do that, then it's plausible your approach can work.

2 replies

Dan-wanna-M Jun 5, 2024
Author

The full DFA is definitely ideal here in the sense that you can stop and start it in arbitrary states. So I can see how that would help with the longest common prefix here. And the full DFA is really the only engine capable of that. It might be possible to do the same with the PikeVM if you could abandon captures, but a "state" in that case wouldn't just be a single state ID, but an ordered set of state IDs. And of course, the PikeVM is quite a bit slower than the full DFA.

With that said... you said the magic words here:

untrusted regular expressions

And then here:

The warning on exponential memory and time complexity is definitely worrying though.

Yes indeed. You should be worried. If the regexes are truly untrusted, then building a full DFA from them is a really really bad idea. It wouldn't take much to DoS you. The docs even have examples of very short regexes that will use exponential time. And of course, you'll have to worry about memory exhaustion too.

I would like to know whether there is a better approach.

Dunno really. Given what you've told me as hard constraints, I'd probably just sacrifice the perf, depending on how much it is. It'd still be correct but maybe not the fastest possible thing. If every ounce of perf is critical, then you might be in "bespoke regex engine" land. And I'm not really sure where I'd start.

The "untrusted regex" part is really the killer here. Untrusted regexes are usually are very very bad idea. I recognize they are occasionally useful, but you need to be really careful. One possible idea here is to compile the untrusted regexes to DFAs in a carefully controlled sandbox with modest resource limits. If you can do that, then it's plausible your approach can work.

I see. By the way, since the fully compiled DFAs have configs that limit their memory usage, can't I rely upon them to avoid DoS? Or there exists some patterns that has exponential time complexity but linear space complexity?

BurntSushi Jun 5, 2024
Maintainer

That's a good point. Yes you can rely on those limits when properly configured, to an extent. At least, that is the intent. The meta regex engine does exactly that. You just might be surprised at how limiting it is. You'll get a lot more mileage out of it if you don't need to support Unicode. Note though that the meta regex engine also uses heuristic limits on the size of the NFA. If the NFA is too big, then it won't even attempt DFA construction. Namely, you can generally expect a DFA to be at least as big as an NFA.

As for exponential time complexity and linear space complexity... I'm not sure to be honest. I can't think of an example off the top of my head. My intuition is that it isn't possible. That is, to get to exponential time you will also need to use exponential space. But I think that's tricky to prove. Namely, DFA determinization does make use of a cache of states for reuse. So in theory you could wind up with an exponential number of cache hits, but small overall space usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The most efficient way to search multiple texts with long common prefixs #1199

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

The most efficient way to search multiple texts with long common prefixs #1199

Dan-wanna-M Jun 5, 2024

Replies: 1 comment · 2 replies

BurntSushi Jun 5, 2024 Maintainer

Dan-wanna-M Jun 5, 2024 Author

BurntSushi Jun 5, 2024 Maintainer

Dan-wanna-M
Jun 5, 2024

Replies: 1 comment 2 replies

BurntSushi
Jun 5, 2024
Maintainer

Dan-wanna-M Jun 5, 2024
Author

BurntSushi Jun 5, 2024
Maintainer