why does this crate and PCRE2 differ with respect to \w+|[^\w\s]+
when searching haystacks with Unicode data?
#1019
-
What version of regex are you using?v1.8.4 Describe the bug at a high level.match string "戦場のヴァルキュリア3" with pattern r"\w+|[^\w\s]+" give 1 match What are the steps to reproduce the behavior?here is the rust code I used: What is the actual behavior?the rust code gives 1 match, but it looks 2 matches is right What is the expected behavior?expect 2 matches |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
It looks like you asked this question on StackOverflow, and the answer you got was basically correct. With a few missing details added via the comments there. The result you're getting is correct. The key thing you're likely missing is that this crate defaults to treating Making |
Beta Was this translation helpful? Give feedback.
It looks like you asked this question on StackOverflow, and the answer you got was basically correct. With a few missing details added via the comments there.
The result you're getting is correct. The key thing you're likely missing is that this crate defaults to treating
\w
as Unicode-aware, where as PCRE2 defaults to treating\w
as ASCII-only. You can make PCRE2 treat\w
as Unicode-aware (by enabling thePCRE2_UCP
option), and similarly, you can make this crate treat\w
as ASCII only. For example,(?-u:\w)
and[\w&&\p{ascii}]
are precisely equivalent.Making
[^\w\s]
ASCII-only is a little trickier though, since(?-u:[^\w\s])
will match any individual byte that isn't in\w
or\s
. That in…