Add Unicode and Windows support #20

LEFTazs · 2020-02-03T21:50:01Z

The Damerau-Levenshtein method doesn't crash on windows anymore: fixes Jupyter notebook crashes when using dam_lev with transpose_costs #16
Unicode characters don't cause crash anymore: fixes Spezial characters #13
Added test cases to test unicode support

Note:

The ALPHABET_SIZE constant has been increased to 512 because of the added Unicode support.
This should be enough for most users. A better alternative could be making ALPHABET_SIZE choosable by the user.

weighted_levenshtein/clev.pyx

weighted_levenshtein/clev.pxd

dsu1995 · 2020-02-04T04:48:23Z

Thanks LEFTazs! Really appreciate your contribution. However, I do have a concern about breaking backwards compatibility, so this PR might still need some changes.

weighted_levenshtein/clev.pyx

LEFTazs · 2020-02-07T21:24:52Z

The code is now backwards compatible. All tests run without error.

dsu1995 · 2020-02-09T06:50:07Z

weighted_levenshtein/clev.pyx

@@ -1,10 +1,10 @@
 #!python
-# cython: language_level=3, boundscheck=False, wraparound=False, embedsignature=True, linetrace=True, c_string_type=str, c_string_encoding=ascii
-# distutils: define_macros=CYTHON_TRACE_NOGIL=1


Why was CYTHON_TRACE_NOGIL removed?

dsu1995 · 2020-02-09T06:55:28Z

weighted_levenshtein/clev.pyx

@@ -132,11 +132,20 @@ cdef inline DTYPE_t row_insert_range_cost(

 # End Array2D

+cdef unsigned int* convert_string_to_int_array(unsigned char* str, Py_ssize_t size):


I believe this function can be nogil if we get rid of enumerate and ord:

cdef unsigned int* intarr = <unsigned int*> malloc(size * sizeof(unsigned int)) for i in range(size): intarr[i] = str[i] return intarr

I don't think ord is necessary since in C unsigned char can be widened to unsigned int implicitly without loss of precision.

dsu1995 · 2020-02-09T07:03:36Z

weighted_levenshtein/clev.pyx


-cdef inline unsigned char str_1_get(unsigned char* s, Py_ssize_t i) nogil:
+cdef void copy_str_to_int_arr(unsigned char* str, Py_ssize_t len, unsigned int* int_arr) nogil:


Why do we need both this function and convert_string_to_int_array?

dsu1995 · 2020-02-09T07:13:09Z

weighted_levenshtein/clev.pyx

@@ -179,24 +188,33 @@ def damerau_levenshtein(
    if transpose_costs is None:
        transpose_costs = unit_matrix

-    s1 = str(str1).encode()  
-    s2 = str(str2).encode()  
+    s1 = str1.encode('utf-8').decode('utf-8')


Could we maybe add a comment explaining what this part is doing? There are a lot of subtleties in Cython string implicit conversion (and frankly I can't really tell what this is doing either without digging through a lot of documentation).

Also it would be nice to create a helper function for string conversion that can be shared between lev, dam_lev, and osa. Maybe we can document what this does in that helper function.

Perhaps the helper function can take in the unsigned char* and return a tuple of (unsigned int*, Py_ssize_t) that returns the int array and length. See here for how to return a tuple in Cython.

dsu1995 · 2020-02-09T07:30:52Z

Thank you for your hard work LEFTazs! I left a couple more minor comments, but I think we have addressed all the major issues. I think this PR will be in a good place after one or two more iterations.

Also just a heads up, I am not an owner of this repo, so I cannot approve this PR. However, I will try to reach out to the owners of this repo to get this merged, so that your hard work isn't wasted.

pachewise · 2020-12-11T00:45:11Z

@LEFTazs, any chance you can make the changes requested? I can help merge this once this code is ready, but honestly not as well versed in this code, so @dsu1995 would be great if you can also review after changes are made.

emanuelevivoli · 2021-12-20T14:16:00Z

Any updates? I'm interested in having Unicode support, too.

shotofcovfefe · 2022-05-25T23:09:01Z

This thread has been stagnant for some time now, however, I'd really appreciate seeing it merged!

PinguDEV-original · 2024-06-05T19:39:46Z

Yeah please!

LEFTazs added 2 commits February 3, 2020 17:43

Add unicode character support for levenshtein and osa

6272927

Add Windows and unicode support for damerau

9d64fb8

dsu1995 reviewed Feb 4, 2020

View reviewed changes