Dealing with unicode character counts #1187

spyoungtech · 2024-04-25T05:23:14Z

spyoungtech
Apr 25, 2024

I'm working on a library that uses PyO3 to adapt the regex crate to Python. One sticking point I am finding is that unicode characters are counted differently.

>>> regexrs.match('.*', 'hello \N{EARTH GLOBE AMERICAS}')  # the rust implementation
<regexrs.Match object; span=(0, 10), match="hello 🌎">

>>> re.match('.*', 'hello \N{EARTH GLOBE AMERICAS}')  # python stdlib implementation
<re.Match object; span=(0, 7), match='hello 🌎'>

Python likes to count 🌎 as one character, while the regex crate likes to count it per byte -- as four characters. This becomes problematic in use cases such as when Python uses the end() value to index into the python string. When passing the pos parameter (e.g., for captures_at), this also introduces some discrepancies.

Is there any reasonable way I can deal with this problem? Is there anything the regex crate could do to support such a use case?

Somewhat related: #54

Based on this answer I've tried using the following when I create my Match objects in rust:

#[pyfunction]
fn r#match(
    py: Python,
    pattern: PyObject,
    string: String,
    flags: Option<i32>,
) -> PyResult<Option<Match>> {
// ...
                    return Ok(Some(Match {
                        // ...
-                        endpos: matched.end()
+                        endpos: matched.start() + matched.as_str().graphemes(true).count(),
                       // ...
                    }));

This seems to work, but is incomplete (the start pos may also be offset by multibyte characters) and I'm unsure if this is a sane approach or if there's a better (more performant) way. It also doesn't cover the pos capture_at use case with Python's pattern.match.

Answered by BurntSushi

Apr 25, 2024

Grapheme clusters are a total red herring here. They aren't relevant. Python doesn't use grapheme cluster offsets. It uses codepoint offsets. I also don't really understand the \r\n diversion. They are two distinct ASCII characters. Python strings will treat them as two distinct characters, just like Rust strings. The offsets will even be the same. Python might normalize line endings when doing I/O, but that shouldn't be a concern for interfacing the regex crate with Python strings.

To respond more holistically, it's important to get terminology correct here. The issue is not that this crate "counts Unicode characters differently" from Python. That would be very bad. The issue is that the o…

View full answer

spyoungtech · 2024-04-25T09:03:13Z

spyoungtech
Apr 25, 2024
Author

A somewhat more complete solution also needs to account for the fact that \r\n is counted differently.

One possible solution is to:

Convert the user-given index (a grapheme-like index understood by Python users) to a byte offset understood by Regex.captures_at
After [each] regex::Match object is obtained, convert the .start() and .end() values from byte offsets to grapheme indices understood by Python when creating the Python Match object.

A function for converting a grapheme index (as, say, passed by a Python user) to a byte index, usable with captures_at.

expand me

fn grapheme_index_to_byte_index(text: &str, char_pos: usize) -> usize {...}

fn grapheme_index_to_byte_index(text: &str, char_pos: usize) -> usize {
    if char_pos == 0 {
        return 0
    }
    let mut current_pos = 0;
    let mut byte_index = 0;
    let mut chars = text.chars().peekable();

    while let Some(ch) = chars.next() {
        if current_pos == char_pos {
            break;
        }

        if ch == '\r' {
            if let Some(&'\n') = chars.peek() {
                // Treat "\r\n" as two graphemes; move the position twice if needed
                if current_pos + 1 == char_pos {
                    // Move to the next character ('\n') if the position matches
                    chars.next();  // Consume '\n'
                    byte_index += ch.len_utf8(); // Add the length of '\r'
                }
            }
        }

        // Update the current grapheme position and byte index
        current_pos += 1;
        byte_index += ch.len_utf8();
    }

    byte_index
}

A function to take the start/end positions as byte indices, as returned by Match.start() / Match.end(), and convert them into grapheme indices, as a Python user might expect:

fn get_grapheme_start_end(text: &str, start_byte_index: usize, end_byte_index: usize,) -> (usize, usize) {
    let start_slice = &text[..start_byte_index];
    let start = start_slice.chars().count() + start_slice.matches("\r\n").count();
    let end_slice = &text[start_byte_index..end_byte_index];
    let end = start + end_slice.chars().count() + end_slice.matches("\r\n").count();
    (start, end)
}

And the match method implementation:

Expand me.

#[pymethods]
impl Pattern {
    pub fn r#match(&self, string: String, pos: Option<usize>) -> PyResult<Option<Match>> {
        let p = grapheme_index_to_byte_index(string.as_str(), pos.unwrap_or(0));
        if let Some(caps) = self.regex.captures_at(&string, p) {
            if let Some(matched) = caps.get(0) {
                if matched.start() == p {
                    // ... omitted details
                    let (start, end) = get_grapheme_start_end(&string, matched.start(), matched.end());
                    return Ok(Some(Match {
                        string: String::from(matched.as_str()),
                        re: self.clone(),
                        pos: start,
                        endpos: end,
                        lastgroup: last_group_name,
                    }));
                }
            }
        }
        Ok(None) // No match found or the match does not start at 'p'
    }

Unfortunately, this approach is horribly inefficient.

0 replies

BurntSushi · 2024-04-25T11:12:46Z

BurntSushi
Apr 25, 2024
Maintainer

Grapheme clusters are a total red herring here. They aren't relevant. Python doesn't use grapheme cluster offsets. It uses codepoint offsets. I also don't really understand the \r\n diversion. They are two distinct ASCII characters. Python strings will treat them as two distinct characters, just like Rust strings. The offsets will even be the same. Python might normalize line endings when doing I/O, but that shouldn't be a concern for interfacing the regex crate with Python strings.

To respond more holistically, it's important to get terminology correct here. The issue is not that this crate "counts Unicode characters differently" from Python. That would be very bad. The issue is that the offsets used by this crate and the offsets used by Python are not equivalent. The issue is even deeper than that, because the correct choice for offsets is coupled with the representation and cost model exposed by the corresponding string data type. In Rust, strings are always UTF-8. The correct offsets to use for such strings are byte offsets because they provide constant time substring slicing. In Python, strings are logically a sequence of codepoints. (The actual representation might be a sequence of bytes if the string is all ASCII, but this is consistent with "sequence of codepoints.") In this context, the correct offsets to use for such strings are codepoint offsets, because they provide constant time substring slicing. Notice though that codepoint offsets and character offsets are not necessarily the same. In part because "character" is a very ambiguous notion, and in part because the maximal interpretation of "character" is "grapheme cluster." And Python certainly does not use grapheme cluster offsets.

Another way of looking at it is that for any given string data type, the correct offsets to use are always code unit offsets. For UTF-8, code units are equivalent to the individual bytes that make up the UTF-8 encoding. For Python, its general representation is equivalent to UTF-32, and thus its code unit offsets are equivalent to codepoints. If a string data type is UTF-16, then code units are 16-bit unsigned integers, with codepoints in the basic multilingual plane being encoded with one code unit and all other codepoints being encoded with two code units. Indeed, if you use regexes in Java or C#, the offsets you get back are in terms of UTF-16 code units.

So, in summary, the root cause of the issue you're facing is very fundamental: it can be traced directly to the different representation choices for strings themselves.

You really only have two ways of dealing with this:

Convert between offsets as needed. This adds O(n) extra time processing to each call.
Compute a map that converts between offsets for a given haystack, and use that map to translate between offsets as needed. This adds O(n) extra space.

(2) is probably the most robust choice. The extra cost of computing the map is a bummer, but it's probably not large relative to the work you need to do anyway. In particular, in order to use this crate with a Python string, you already need to encode the string as UTF-8. So you're already doing O(n) work just to prepare the string for searching.

For more elaboration on this point, see: BurntSushi/aho-corasick#72

Finally, G-Research/ahocorasick_rs is a Python wrapper library for my aho-corasick crate. It has to solve precisely this same problem. So you might benefit from seeing how they arrange things.

1 reply

spyoungtech Apr 25, 2024
Author

Thank you, Andrew. I was thinking too hard about this problem and conflating some other issues I was having. The \r\n issue was related to trying to approach this problem with grapheme clusters (for which, unicode treats CRLF as one cluster), which as you point out was not the right approach. Your explanation makes this very clear to me now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with unicode character counts #1187

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Dealing with unicode character counts #1187

spyoungtech Apr 25, 2024

Replies: 2 comments · 1 reply

spyoungtech Apr 25, 2024 Author

BurntSushi Apr 25, 2024 Maintainer

spyoungtech Apr 25, 2024 Author

spyoungtech
Apr 25, 2024

Replies: 2 comments 1 reply

spyoungtech
Apr 25, 2024
Author

BurntSushi
Apr 25, 2024
Maintainer

spyoungtech Apr 25, 2024
Author