Define consistent upper bound for hybrid DFA #1089
-
I'm currently rewriting my streaming regex search from using a dense DFA to a hybrid DFA, since a dense DFA would lead to excessive memory usage with patterns like On the memory site, I can set an NFA and cache limit and I never run into any memory problems, so that's all good. But on the runtime side I'm not entirely sure what the "best practice" would be. Of course I could use a timer-based system to determine how long the search has taken, but I was hoping to avoid that by limiting the number of cache resets instead. This works pretty well and "complex" regexes that take a long time to search reliably exceed the reset count, while simpler regexes searching the same size of haystack still work. However the reset count is persistent, so that leaves me only two options: Either I reset it before every search and lose the ability to cache the entire regex to avoid reconstruction for multiple searches, or I reset it on error and users will inevitably run into this error after using any regex that resets the cache more than once. So I guess this boils down to two questions:
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 9 replies
-
It might help to look at the configuration that a regex/regex-automata/src/meta/wrappers.rs Lines 581 to 599 in 061ee81 Basically, it uses a minimum cache clear count and a minimum bytes per state that have been processed. The second option is a good idea because cache clearing on its own doesn't mean the search will be slow. If it's only cleared occasionally, then it's probably still faster than the PikeVM (for example). Of course, it's just a heuristic, so it can be wrong. But the combination of minimum cache clear count and minimum bytes per state is usually better in practice than either on its own in my experience. With that out of the way, I'm having a little trouble parsing your question. Why do you want to reset count to zero at the start of every search? The count is persistent because of an assumption that if the lazy DFA is slow on one haystack then it will likely be slow on the next haystack. I'm open to exposing a method just for clearing the cache count and nothing else. Would that solve your problem? I just want to make sure I understand the use case so that I can tell whether the method is well motivated. |
Beta Was this translation helpful? Give feedback.
-
I've read this in the documentation, but this is specifically concerned with efficiency, right? So the purpose is to detect when to switch to a different regex implementation. But my goal is to determine when I should stop the search entirely.
The problem is the complexity of the original regex. But maybe I'm misunderstanding the way the cache resets work. I'd imagine with a limited memory size (which I have) certain patterns are inevitably going to run into cache resets. And a single cache reset isn't necessarily going to mean that I should give up already. And if I run into a cache reset once, running the same search again will produce another reset (since the entire DFA cannot be cached for this haystack). So I want to limit the amount of complexity allowed in a single search (complexity here meaning the number of DFA states it produces/memory it consumes I suppose). Which the reset count might not be perfectly well suited for because it depends on the previous searches (it's basically imprecise by up to The perfect behavior I think I'd only get if I reset my cache before every search. Then I could define my memory limit through (I'm sorry this got kinda rambly, but I'm not confident in my understanding of the internals so I hope this at least makes clear what my goal as a user is.)
I'm not sure what you mean by "cache count". I'd assume you mean the cache reset count? |
Beta Was this translation helpful? Give feedback.
It might help to look at the configuration that a
meta::Regex
uses:regex/regex-automata/src/meta/wrappers.rs
Lines 581 to 599 in 061ee81