Define consistent upper bound for hybrid DFA #1089

chrisduerr · 2023-09-08T21:01:51Z

chrisduerr
Sep 8, 2023

I'm currently rewriting my streaming regex search from using a dense DFA to a hybrid DFA, since a dense DFA would lead to excessive memory usage with patterns like [0-9A-Za-z]{50}. To prevent issues of excessive memory usage and runtime in the future, I'd also like to set some boundaries for the lazy regex to abort rather than lock things up.

On the memory site, I can set an NFA and cache limit and I never run into any memory problems, so that's all good.

But on the runtime side I'm not entirely sure what the "best practice" would be. Of course I could use a timer-based system to determine how long the search has taken, but I was hoping to avoid that by limiting the number of cache resets instead. This works pretty well and "complex" regexes that take a long time to search reliably exceed the reset count, while simpler regexes searching the same size of haystack still work. However the reset count is persistent, so that leaves me only two options: Either I reset it before every search and lose the ability to cache the entire regex to avoid reconstruction for multiple searches, or I reset it on error and users will inevitably run into this error after using any regex that resets the cache more than once.

So I guess this boils down to two questions:

Should I even be using the number of cache resets as a mechanism to limit runtime?
Is it possible to abort on excessive cache resets in a single search?

Answered by BurntSushi

Sep 8, 2023

It might help to look at the configuration that a meta::Regex uses:

regex/regex-automata/src/meta/wrappers.rs

Lines 581 to 599 in 061ee81

     .cache_capacity(info.config().get_hybrid_cache_capacity())  
   // This makes it possible for building a lazy DFA to  
   // fail even though the NFA has already been built. Namely,  
   // if the cache capacity is too small to fit some minimum  
   // number of states (which is small, like 4 or 5), then the  
   // DFA will refuse to build.  
   //  
   // We shouldn't enable this to make building always work, since  
   // this could cause the allocation of a cache bigger than the  
   // provided capacity amount.  
   //  
   // This is e…

View full answer

BurntSushi · 2023-09-08T21:34:47Z

BurntSushi
Sep 8, 2023
Maintainer

It might help to look at the configuration that a meta::Regex uses:

regex/regex-automata/src/meta/wrappers.rs

Lines 581 to 599 in 061ee81

    
           .cache_capacity(info.config().get_hybrid_cache_capacity()) 
        
           // This makes it possible for building a lazy DFA to 
        
           // fail even though the NFA has already been built. Namely, 
        
           // if the cache capacity is too small to fit some minimum 
        
           // number of states (which is small, like 4 or 5), then the 
        
           // DFA will refuse to build. 
        
           // 
        
           // We shouldn't enable this to make building always work, since 
        
           // this could cause the allocation of a cache bigger than the 
        
           // provided capacity amount. 
        
           // 
        
           // This is effectively the only reason why building a lazy DFA 
        
           // could fail. If it does, then we simply suppress the error 
        
           // and return None. 
        
           .skip_cache_capacity_check(false) 
        
           // This and enabling heuristic Unicode word boundary support 
        
           // above make it so the lazy DFA can quit at match time. 
        
           .minimum_cache_clear_count(Some(3)) 
        
           .minimum_bytes_per_state(Some(10));

Basically, it uses a minimum cache clear count and a minimum bytes per state that have been processed. The second option is a good idea because cache clearing on its own doesn't mean the search will be slow. If it's only cleared occasionally, then it's probably still faster than the PikeVM (for example). Of course, it's just a heuristic, so it can be wrong. But the combination of minimum cache clear count and minimum bytes per state is usually better in practice than either on its own in my experience.

With that out of the way, I'm having a little trouble parsing your question. Why do you want to reset count to zero at the start of every search? The count is persistent because of an assumption that if the lazy DFA is slow on one haystack then it will likely be slow on the next haystack.

I'm open to exposing a method just for clearing the cache count and nothing else. Would that solve your problem? I just want to make sure I understand the use case so that I can tell whether the method is well motivated.

0 replies

chrisduerr · 2023-09-08T22:15:29Z

chrisduerr
Sep 8, 2023
Author

Basically, it uses a minimum cache clear count and a minimum bytes per state that have been processed. The second option is a good idea because cache clearing on its own doesn't mean the search will be slow. If it's only cleared occasionally, then it's probably still faster than the PikeVM (for example). Of course, it's just a heuristic, so it can be wrong. But the combination of minimum cache clear count and minimum bytes per state is usually better in practice than either on its own in my experience.

I've read this in the documentation, but this is specifically concerned with efficiency, right? So the purpose is to detect when to switch to a different regex implementation. But my goal is to determine when I should stop the search entirely.

With that out of the way, I'm having a little trouble parsing your question. Why do you want to reset count to zero at the start of every search? The count is persistent because of an assumption that if the lazy DFA is slow on one haystack then it will likely be slow on the next haystack.

The problem is the complexity of the original regex. [0-9A-Za-z]{999999} for example on a haystack with just a load of Xs will cause a ton of resets, but I'd imagine there are "reasonable" regexes that can produce maybe one reset every once in a while. Inevitably I'd run into the upper bound with those.

But maybe I'm misunderstanding the way the cache resets work. I'd imagine with a limited memory size (which I have) certain patterns are inevitably going to run into cache resets. And a single cache reset isn't necessarily going to mean that I should give up already. And if I run into a cache reset once, running the same search again will produce another reset (since the entire DFA cannot be cached for this haystack).

So I want to limit the amount of complexity allowed in a single search (complexity here meaning the number of DFA states it produces/memory it consumes I suppose). Which the reset count might not be perfectly well suited for because it depends on the previous searches (it's basically imprecise by up to cache_capacity), but still seems like the best possible solution. I could in theory just always fail on the first cache reset, but I'd imagine with non-overlapping patterns that's going to cause unexpected behavior (overlap between haystack full of X and full of Y for pattern [XY]{99999} is probably small).

The perfect behavior I think I'd only get if I reset my cache before every search. Then I could define my memory limit through cache_size and complexity limit through reset count. But this wouldn't allow me to reuse the cache for DFAs that can be fully cached in memory anyway (which I'd imagine are the majority).

(I'm sorry this got kinda rambly, but I'm not confident in my understanding of the internals so I hope this at least makes clear what my goal as a user is.)

I'm open to exposing a method just for clearing the cache count and nothing else. Would that solve your problem? I just want to make sure I understand the use case so that I can tell whether the method is well motivated.

I'm not sure what you mean by "cache count". I'd assume you mean the cache reset count?

9 replies

chrisduerr Sep 9, 2023
Author

If you are indeed concerned about efficiency and not just about completing a search, then I'm unclear at why you don't want to start with the configuration that meta::Regex uses.

That's probably fair. But I don't exactly know why these values are what they are so that makes me a little uncomfortable with just copy/pasting random values.

A pattern of [0-9A-Za-z]{9999} for example causes a cache error, while [0-9A-Za-z]{99999} causes an NFA compile error with the default settings. That did seem a bit quick to me, so cranking memory up to 10M "feels" like it works a bit better. But at that point I'm just turning knobs without knowing what I'm trying to achieve exactly. Maybe I'm just overly pedantic though, I doubt anyone will ever use a pattern like this.

The "minimum bytes per state" is also carried over from search to search. So while for some regexes the cache used will never grow beyond a certain amount so long as the haystack length is below some fixed size, that it certainly not true for all regexes.

Interesting, I didn't consider that, thanks.

The cache can depend on the specific contents of the haystack.

This is only true for partially constructed DFAs though, right? If it fits entirely into RAM, then the haystack shouldn't matter.

The question is just figuring out both what "fall apart" means (i.e., what kind of perf/memory can you tolerate) and how to detect it (which is essentially limited to the minimum cache clear count and minimum bytes per state).

Yeah that's fair. I suppose since it's somewhat heuristic heavy it's also very difficult to find a "perfect" solution.

I think I'll try running with the configuration from meta::Regex for a while and if anyone comes around and complains I can try tweaking things more later on. It probably makes sense to rely on the battle-tested solution and then further hone it in for my usecase over time.

BurntSushi Sep 9, 2023
Maintainer

This is only true for partially constructed DFAs though, right? If it fits entirely into RAM, then the haystack shouldn't matter.

Hmmm. A lazy DFA is pretty much always partially constructed. Take the regex [a-z]. If you search the haystacks a and then b, then a new transition will get added in the latter case. Well maybe not because if the equivalence class optimization, but it will certainly happen with \pL. In that case, entire new states can get added for each search, even if they are all haystacks consisting of a single codepoint. Different codepoints will cause different parts of the state graph to get explored.

Otherwise, yes, I agree with your starting point. The "why" just really has to do with how big the state graph is and just how much of it has to be explored to service your haystacks. It is very difficult to pithily describe it unfortunately.

If you want to find me on Discord we can have a synchronous chat if that would help. Then we can kind of go back and forth with examples. (A whiteboard would be ideal, but alas...)

BurntSushi Sep 9, 2023
Maintainer

Maybe this will help: for each call to next_state, at most one new transition/state will be added. That's how it retains the linear time guarantee.

chrisduerr Sep 9, 2023
Author

If you want to find me on Discord we can have a synchronous chat if that would help. Then we can kind of go back and forth with examples. (A whiteboard would be ideal, but alas...)

I think I've arrived at an acceptable solution for now, thanks for all your help. I'll be sure to come crying again if I run into any more issues.

BurntSushi Sep 9, 2023
Maintainer

Aye. And yeah I'm not a fan of Discord. I miss IRC.

I'm also in Zulip.

I sympathize greatly with that xkcd haha.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define consistent upper bound for hybrid DFA #1089

{{title}}

Replies: 2 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

	.cache_capacity(info.config().get_hybrid_cache_capacity())
	// This makes it possible for building a lazy DFA to
	// fail even though the NFA has already been built. Namely,
	// if the cache capacity is too small to fit some minimum
	// number of states (which is small, like 4 or 5), then the
	// DFA will refuse to build.
	//
	// We shouldn't enable this to make building always work, since
	// this could cause the allocation of a cache bigger than the
	// provided capacity amount.
	//
	// This is e…

Define consistent upper bound for hybrid DFA #1089

chrisduerr Sep 8, 2023

Replies: 2 comments · 9 replies

BurntSushi Sep 8, 2023 Maintainer

chrisduerr Sep 8, 2023 Author

chrisduerr Sep 9, 2023 Author

BurntSushi Sep 9, 2023 Maintainer

BurntSushi Sep 9, 2023 Maintainer

chrisduerr Sep 9, 2023 Author

BurntSushi Sep 9, 2023 Maintainer

chrisduerr
Sep 8, 2023

Replies: 2 comments 9 replies

BurntSushi
Sep 8, 2023
Maintainer

chrisduerr
Sep 8, 2023
Author

chrisduerr Sep 9, 2023
Author

BurntSushi Sep 9, 2023
Maintainer

BurntSushi Sep 9, 2023
Maintainer

chrisduerr Sep 9, 2023
Author

BurntSushi Sep 9, 2023
Maintainer