Dealing with unicode character counts #1187
-
I'm working on a library that uses PyO3 to adapt the >>> regexrs.match('.*', 'hello \N{EARTH GLOBE AMERICAS}') # the rust implementation
<regexrs.Match object; span=(0, 10), match="hello 🌎">
>>> re.match('.*', 'hello \N{EARTH GLOBE AMERICAS}') # python stdlib implementation
<re.Match object; span=(0, 7), match='hello 🌎'> Python likes to count Is there any reasonable way I can deal with this problem? Is there anything the regex crate could do to support such a use case? Somewhat related: #54 Based on this answer I've tried using the following when I create my #[pyfunction]
fn r#match(
py: Python,
pattern: PyObject,
string: String,
flags: Option<i32>,
) -> PyResult<Option<Match>> {
// ...
return Ok(Some(Match {
// ...
- endpos: matched.end()
+ endpos: matched.start() + matched.as_str().graphemes(true).count(),
// ...
})); This seems to work, but is incomplete (the start |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
A somewhat more complete solution also needs to account for the fact that One possible solution is to:
A function for converting a grapheme index (as, say, passed by a Python user) to a byte index, usable with expand me
|
Beta Was this translation helpful? Give feedback.
-
Grapheme clusters are a total red herring here. They aren't relevant. Python doesn't use grapheme cluster offsets. It uses codepoint offsets. I also don't really understand the To respond more holistically, it's important to get terminology correct here. The issue is not that this crate "counts Unicode characters differently" from Python. That would be very bad. The issue is that the offsets used by this crate and the offsets used by Python are not equivalent. The issue is even deeper than that, because the correct choice for offsets is coupled with the representation and cost model exposed by the corresponding string data type. In Rust, strings are always UTF-8. The correct offsets to use for such strings are byte offsets because they provide constant time substring slicing. In Python, strings are logically a sequence of codepoints. (The actual representation might be a sequence of bytes if the string is all ASCII, but this is consistent with "sequence of codepoints.") In this context, the correct offsets to use for such strings are codepoint offsets, because they provide constant time substring slicing. Notice though that codepoint offsets and character offsets are not necessarily the same. In part because "character" is a very ambiguous notion, and in part because the maximal interpretation of "character" is "grapheme cluster." And Python certainly does not use grapheme cluster offsets. Another way of looking at it is that for any given string data type, the correct offsets to use are always code unit offsets. For UTF-8, code units are equivalent to the individual bytes that make up the UTF-8 encoding. For Python, its general representation is equivalent to UTF-32, and thus its code unit offsets are equivalent to codepoints. If a string data type is UTF-16, then code units are 16-bit unsigned integers, with codepoints in the basic multilingual plane being encoded with one code unit and all other codepoints being encoded with two code units. Indeed, if you use regexes in Java or C#, the offsets you get back are in terms of UTF-16 code units. So, in summary, the root cause of the issue you're facing is very fundamental: it can be traced directly to the different representation choices for strings themselves. You really only have two ways of dealing with this:
(2) is probably the most robust choice. The extra cost of computing the map is a bummer, but it's probably not large relative to the work you need to do anyway. In particular, in order to use this crate with a Python string, you already need to encode the string as UTF-8. So you're already doing For more elaboration on this point, see: BurntSushi/aho-corasick#72 Finally, G-Research/ahocorasick_rs is a Python wrapper library for my |
Beta Was this translation helpful? Give feedback.
Grapheme clusters are a total red herring here. They aren't relevant. Python doesn't use grapheme cluster offsets. It uses codepoint offsets. I also don't really understand the
\r\n
diversion. They are two distinct ASCII characters. Python strings will treat them as two distinct characters, just like Rust strings. The offsets will even be the same. Python might normalize line endings when doing I/O, but that shouldn't be a concern for interfacing theregex
crate with Python strings.To respond more holistically, it's important to get terminology correct here. The issue is not that this crate "counts Unicode characters differently" from Python. That would be very bad. The issue is that the o…