Is the regex crate a bottleneck in your program? If so, can you share the details? #960
Replies: 8 comments 25 replies
-
Is pattern disclosure via an email ([email protected]) an option? I am aware that the regexes would end up in public resources in the end, but I'd be keen to discuss their use case outside of a public eye. For my use case I care mostly about the performance of |
Beta Was this translation helpful? Give feedback.
-
This comes with a big caveat that this is not the regex I wanted to write, but the regex that runs the fastest today when using the I ran less into the current implementation being slow (they're fast enough to make memory bandwidth the bottleneck) and more into the crate not exposing the APIs I need to avoid multiple scans over the input. I need to account for both IIRC I tried using The code and detailed notes about performance can be found here. |
Beta Was this translation helpful? Give feedback.
-
I use The regex is here and if you're interested I can provide some arbitrarily large haystacks |
Beta Was this translation helpful? Give feedback.
-
We have run into issues with the size of the compiled regexes. For example |
Beta Was this translation helpful? Give feedback.
-
Please don’t judge me for this monstrosity (😅), but here is one I use to find potential AWS keys in code:
the comment describes the thought process:
I would use captures if I wasn’t outsourcing this to ripgrep. I ran this over every bit of code published to pypi and rubygems |
Beta Was this translation helpful? Give feedback.
-
Hello, I believe I have some insights that may be helpful for you, since I've spend a lot of time tuning high-performance regex parsing application some months ago. Brief background & motivation:In our company, we produce large volume of unstructured logs in proprietary format and I often had to do adhoc analysis on files that are around 4GB.
something like
which lends well to further analysis. As Ripgrep is an excellent tool that can process these multi-GB files in low seconds and regex crate also has great reputation overall, I've got inspired to try rewriting and generalizing this idea as Rust CLI app transforming stream of messages with arbitrary log format specified mainly by hierarchy of regexes into stream of JSON messages. I've managed to do it, but the result wasn't as fast as I've hoped, not enough to casually transform whole files near-interactively, so I had to stick with per-case prefiltering just like before, ultimately making the tool not worth finishing, for now at least. As the regexes are user facing and often adhoc hand tuned for individual servers for handling their own unique message kinds, the intention was to keep them highlevel and not overoptimized - on the other hand, I've put lot of effort into optimizing the program itself, likely even overengineed that a fair bit, so the resulting performance is mostly (75% according to profiling) spent in the regexes themselves. The I/O cost was actually mostly negligible. Overall I've managed to crunch reference 2.2GB file in around 46 seconds from cold start, which isn't terrible, but in real world I had to wait around 3 minutes for some bigger ones and with, say, 8 logfiles, this wasn't what I hoped for. TakeawaysI've played with mmap and madvise and at least on machines I've been using (Debian Stretch & Buster, x86_64) there wasn't any notable difference from conventional Readers - I've read your post somewhere with similar findings in Ripgrep. I've had fancy-regex as optional conditional compilation feature - they are built on top of regex crate, so one would expect that if you don't use advanced features like backtracking, the regexes may be as performant as raw regex crate - unfortunately, it doesn't seem to be the case and there was small, but noticeable perf hit. Profile guided optimization consistently helped gain around 5-10% speed increase. Original version was written on top of unicode strings. As the regexes I've used needed only ascii matching, I've tried moving to bytewise (regex::bytes) regexes, but that instead lead to performance hit. Even after adding The most notable optimization (from around 4.5 minutes to sub 1 minute) I got from parallelization, having worker threads working on lines independently with number of workers matching number of logical cores (so around 12). I've read your post about the internal scratch spaces in Regex and tradeoffs between API simplicity and optimality - and can confirm that difference between sharing the regexes between threads and cloning them was negligible. Most of the processing time was spent in handling the primary regex, so we'll only consider that one.
In the rust regex syntax the final version I was using looked like this
ConclusionsI wondered why is grepping with ripgrep so much more performant then parsing I did and experimented with various simplified grammars - overall, grep-like usecases that relied on finding string literal (prefix/suffix) were exceptionally fast, while I did try the regex DFA CLI debugging tool and some generated machines were suboptimal or at least far more complex than casual regex user expects, consider following simple examples
But even with optimized state machine generation, I somewhat doubt it'd make much difference unless there's a possibility for large performance gain in how loops are handled. I'm not familiar with regex internals, but profiling showed that vast majority of the time was spent in Overall, I feel it's likely that getting notably better performance would require move from regexes to other parsing solution, but that would effectively kill this project, as easy configuration of the parsing was a key selling point. In any case, I hope this writeup was helpful and I'm glad I had a change to talk a bit about this project. |
Beta Was this translation helpful? Give feedback.
-
Awesome, glad I could help and thanks for additional clarifications.
Right, I recall now reading that in the past, so it's definitely discoverable. If I'll ever come across some info that would be a good fit there, I'll make an issue/PR. The pcre2/jit results look very amazing indeed and picked my curiosity enough to try quickly plugging it in my application (using your |
Beta Was this translation helpful? Give feedback.
-
I wanted to share a case where replacing a regex with some hand-crafted string searching proved to be a nearly ~2x win. It's not necessarily exactly the same result (didn't fuzz etc) but it's functionally equivalent. The program is rustfilt, which finds and demangles Rust symbols in arbitrary text (and prints that text to stdout/file). Here is the regex in question: https://github.com/luser/rustfilt/blob/3c81f107b73fdcf7bbb98b426c1214f15946ec90/src/main.rs#L36: Here is the PRs (rust-lang/rustc-demangle#62, rust-lang/rustc-demangle#63) that added code to rustc-demangle which is in my measurements ~2x faster end-to-end, essentially replacing the regex crate's replace_all with a hand-coded routine which accomplishes the same result. As far as I can tell, the speedup is pretty much entirely down to avoiding some sort of backtracking in I've attached (rustc.script.zst.zip) a zstd-compressed (and then zip'd, so GitHub accepts it) ~980MB file (can likely be cut down, should be pretty uniform) that can be used to reproduce this. The file is the result of
The rustfilt diff to produce the faster version can be found here: luser/rustfilt#21, in general it's just calling into the rustc_demangle::demangle_stream function though. Let me know if this is helpful / I can provide more sample data, etc. (Or if you find bugs in the replacement for regex for this scenario :) |
Beta Was this translation helpful? Give feedback.
-
I am working on a regex benchmark (to be made public probably some time this year), and I'm looking to add regexes to it that are used in the real world. The benchmark will of course include regexes that I've conjured to test certain properties of regex implementations, but I would also like to balance that with real world regexes. I am especially interested in regexes that are used in performance sensitive aspects of your application. It doesn't necessarily have to be the number 1 bottleneck, but ideally it's impacting your bottom line performance in some way.
If you're willing to share, I would love to know both the regex and an example haystack that is searched. I would also like to know which regex APIs you use, for example,
is_match
,find
orcaptures
. Finally, if you could briefly explain what the regex is doing conceptually that would be awesome.Note: many folks are responding which is great. But many are describing the bottleneck they hit instead of showing it. What I'm really asking for here is the actual regex, a (possibly sanitized) haystack, and the API you're using. An ideal submission is a short program that you agree recapitulates the essential performance problem you're facing. The haystack doesn't need to be big. 30MB should be more than enough for example.
Beta Was this translation helpful? Give feedback.
All reactions