-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Unicode and Windows support #20
base: master
Are you sure you want to change the base?
Conversation
Thanks LEFTazs! Really appreciate your contribution. However, I do have a concern about breaking backwards compatibility, so this PR might still need some changes. |
The code is now backwards compatible. All tests run without error. |
@@ -1,10 +1,10 @@ | |||
#!python | |||
# cython: language_level=3, boundscheck=False, wraparound=False, embedsignature=True, linetrace=True, c_string_type=str, c_string_encoding=ascii | |||
# distutils: define_macros=CYTHON_TRACE_NOGIL=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was CYTHON_TRACE_NOGIL
removed?
@@ -132,11 +132,20 @@ cdef inline DTYPE_t row_insert_range_cost( | |||
|
|||
# End Array2D | |||
|
|||
cdef unsigned int* convert_string_to_int_array(unsigned char* str, Py_ssize_t size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this function can be nogil
if we get rid of enumerate
and ord
:
cdef unsigned int* intarr = <unsigned int*> malloc(size * sizeof(unsigned int))
for i in range(size):
intarr[i] = str[i]
return intarr
I don't think ord
is necessary since in C unsigned char
can be widened to unsigned int
implicitly without loss of precision.
|
||
cdef inline unsigned char str_1_get(unsigned char* s, Py_ssize_t i) nogil: | ||
cdef void copy_str_to_int_arr(unsigned char* str, Py_ssize_t len, unsigned int* int_arr) nogil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need both this function and convert_string_to_int_array
?
@@ -179,24 +188,33 @@ def damerau_levenshtein( | |||
if transpose_costs is None: | |||
transpose_costs = unit_matrix | |||
|
|||
s1 = str(str1).encode() | |||
s2 = str(str2).encode() | |||
s1 = str1.encode('utf-8').decode('utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we maybe add a comment explaining what this part is doing? There are a lot of subtleties in Cython string implicit conversion (and frankly I can't really tell what this is doing either without digging through a lot of documentation).
Also it would be nice to create a helper function for string conversion that can be shared between lev, dam_lev, and osa. Maybe we can document what this does in that helper function.
Perhaps the helper function can take in the unsigned char*
and return a tuple of (unsigned int*, Py_ssize_t)
that returns the int array and length. See here for how to return a tuple in Cython.
Thank you for your hard work LEFTazs! I left a couple more minor comments, but I think we have addressed all the major issues. I think this PR will be in a good place after one or two more iterations. Also just a heads up, I am not an owner of this repo, so I cannot approve this PR. However, I will try to reach out to the owners of this repo to get this merged, so that your hard work isn't wasted. |
Any updates? I'm interested in having Unicode support, too. |
This thread has been stagnant for some time now, however, I'd really appreciate seeing it merged! |
Yeah please! |
Note:
The ALPHABET_SIZE constant has been increased to 512 because of the added Unicode support.
This should be enough for most users. A better alternative could be making ALPHABET_SIZE choosable by the user.