why is Python's behavior with empty matches inconsistent with the regex crate's handling of empty matches? #1164
-
What version of regex are you using?
Describe the bug at a high level.
What are the steps to reproduce the behavior?
Rust Code use regex::Regex;
fn main() {
let re = Regex::new(r"x*").unwrap();
let hay = "abxd";
println!("{:?}", re.replace_all(hay, "-"));
} Equivalent Python Code: import re
regex = r"x*"
test_str = "abxd"
subst = "-"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result) What is the actual behavior?
What is the expected behavior?Both empty strings should be replaced, resulting in By the way, I am not sure, if this is an intentional difference or a potential bug? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Note that the pattern in your prose does not match the pattern in your programs. This was confusing to me. I've edited your prose to match your code. I've also converted this to a discussion because this isn't a bug, but is a good question. But no, this is correct behavior. If anything, it's Python's behavior that is somewhat confounding here, because Python's behavior implies it is using an approach that reports overlapping matches. Despite the fact that Python's documentation for Since both use regex::Regex;
fn main() {
let re = Regex::new(r"x*").unwrap();
let hay = "abxd";
for m in re.find_iter(hay) {
println!("{:?}", (m.start(), m.end()));
}
} And a Python program that does the same: import re
regex = r"x*"
test_str = "abxd"
subst = "-"
for m in re.finditer(regex, test_str):
print(m.span()) And now the Rust program's output:
And Python:
This crate very specifically and very intentionally does not report the The reason why regex engines differ here is because empty matches are somewhat of a strange beast. Namely, if you don't handle them specially during iteration over matches, an empty match won't advance the start position of the next search. And thus, it would result in an infinite loop. Therefore, if your regex engine provides iteration over all matches (not all do! but most do), then you must choose how to handle the case of empty matches. This crate does the thing that most other regex engines do: it treats non-empty matches immediately before or immediately after an empty match as overlapping, and thus does not permit those non-empty matches to be reported. Python's regex engine, however, allows such matches. Fun fact: before Python 3.7,
If you read the above issues, there is a claim that Python is now more consistent with other engines, but as far as I can tell, this isn't generally true. While it does seem that Python now matches Perl's behavior, it specifically does not match the behavior of at least the following engines: .NET, D, Go, ICU, Java, Javascript (v8 and regress), RE2 and, of course, this crate. There is also PCRE2 to consider, but PCRE2 doesn't really define an iteration protocol, so it's hard to say whether Python is "consistent" with it or not. |
Beta Was this translation helpful? Give feedback.
Note that the pattern in your prose does not match the pattern in your programs. This was confusing to me. I've edited your prose to match your code. I've also converted this to a discussion because this isn't a bug, but is a good question.
But no, this is correct behavior. If anything, it's Python's behavior that is somewhat confounding here, because Python's behavior implies it is using an approach that reports overlapping matches. Despite the fact that Python's documentation for
re.sub
(andre.findall
andre.finditer
) claims that it reports non-overlapping matches. More generally, there is no specific goal for this crate to match the behavior of any other regex engine. So the existence…