why does \b
differ in behavior from other regex engines such as Swift's?
#1092
-
In regex crate, which "can't" was separated fn main() {
let s = "The quick (\"brown\") fox can't jump 32.3 feet, right?";
let re = Regex::new(r"\b").unwrap();
let res = re.split(s).collect::<Vec<&str>>();
// ["", "The", " ", "quick", " (\"", "brown", "\") ", "fox", " ", "can", "'", "t", " ", "jump", " ", "32", ".", "3", " ", "feet", ", ", "right", "?"]
println!("{res:?}");
} in swift 5.7 let s = "The quick (\"brown\") fox can't jump 32.3 feet, right?"
let words = s.split(separator: /\b/)
// ["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", "fox", " ", "can\'t", " ", "jump", " ", "32.3", " ", "feet", ",", " ", "right", "?"]
print(words)
// In unicode-segmentation crate, it's same as regex in swift, according to the [Unicode Standard Annex #29(http://www.unicode.org/reports/tr29/) rules. fn main() {
let s = "The quick (\"brown\") fox can't jump 32.3 feet, right?";
let res = s.split_word_bounds().collect::<Vec<&str>>();
// ["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", "fox", " ", "can't", " ", "jump", " ", "32.3", " ", "feet", ",", " ", "right", "?"]
println!("{res:?}");
} |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
It's unclear exactly what you're asking, but I took a stab at it by renaming the issue title to the question I think you're asking. This crate defines As for Swift, I went to its documentation page for regex and... found absolutely nothing about its supported syntax. Looking a little closer, I did find Indeed, after some digging into Swift's regex design documents, it is specifically documented to use UAX #29 word segmentation as its definition for Swift's Unicode support in its regex engine is impressive. Basically, this library implements level 1 support of
(To see the above, you'll need to expand the "details" at the end of the Introduction section.) So, bottom line:
|
Beta Was this translation helpful? Give feedback.
It's unclear exactly what you're asking, but I took a stab at it by renaming the issue title to the question I think you're asking.
This crate defines
\b
as a Unicode word boundary. It is documented in the section on syntax. In short, it matches any position where there is a\w
on one side and\W
(or the beginning/end of a string) on the other. This is further documented in UNICODE.md, which specifically links to UTS #18 RL1.4. So in short, this crate defines\b
as a "simple Unicode word boundary" according to the Unicode Technical Standard on regular expressions.As for Swift, I went to its documentation page for regex and... found absolutely nothing about its supported syntax. Looking a…