Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Unicode and Windows support #20

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

LEFTazs
Copy link

@LEFTazs LEFTazs commented Feb 3, 2020

Note:

The ALPHABET_SIZE constant has been increased to 512 because of the added Unicode support.
This should be enough for most users. A better alternative could be making ALPHABET_SIZE choosable by the user.

@dsu1995
Copy link

dsu1995 commented Feb 4, 2020

Thanks LEFTazs! Really appreciate your contribution. However, I do have a concern about breaking backwards compatibility, so this PR might still need some changes.

@LEFTazs
Copy link
Author

LEFTazs commented Feb 7, 2020

The code is now backwards compatible. All tests run without error.

@@ -1,10 +1,10 @@
#!python
# cython: language_level=3, boundscheck=False, wraparound=False, embedsignature=True, linetrace=True, c_string_type=str, c_string_encoding=ascii
# distutils: define_macros=CYTHON_TRACE_NOGIL=1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was CYTHON_TRACE_NOGIL removed?

@@ -132,11 +132,20 @@ cdef inline DTYPE_t row_insert_range_cost(

# End Array2D

cdef unsigned int* convert_string_to_int_array(unsigned char* str, Py_ssize_t size):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this function can be nogil if we get rid of enumerate and ord:

    cdef unsigned int* intarr = <unsigned int*> malloc(size * sizeof(unsigned int))
    for i in range(size):
        intarr[i] = str[i]
    return intarr

I don't think ord is necessary since in C unsigned char can be widened to unsigned int implicitly without loss of precision.


cdef inline unsigned char str_1_get(unsigned char* s, Py_ssize_t i) nogil:
cdef void copy_str_to_int_arr(unsigned char* str, Py_ssize_t len, unsigned int* int_arr) nogil:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need both this function and convert_string_to_int_array?

@@ -179,24 +188,33 @@ def damerau_levenshtein(
if transpose_costs is None:
transpose_costs = unit_matrix

s1 = str(str1).encode()
s2 = str(str2).encode()
s1 = str1.encode('utf-8').decode('utf-8')
Copy link

@dsu1995 dsu1995 Feb 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe add a comment explaining what this part is doing? There are a lot of subtleties in Cython string implicit conversion (and frankly I can't really tell what this is doing either without digging through a lot of documentation).

Also it would be nice to create a helper function for string conversion that can be shared between lev, dam_lev, and osa. Maybe we can document what this does in that helper function.

Perhaps the helper function can take in the unsigned char* and return a tuple of (unsigned int*, Py_ssize_t) that returns the int array and length. See here for how to return a tuple in Cython.

@dsu1995
Copy link

dsu1995 commented Feb 9, 2020

Thank you for your hard work LEFTazs! I left a couple more minor comments, but I think we have addressed all the major issues. I think this PR will be in a good place after one or two more iterations.

Also just a heads up, I am not an owner of this repo, so I cannot approve this PR. However, I will try to reach out to the owners of this repo to get this merged, so that your hard work isn't wasted.

@pachewise
Copy link
Contributor

@LEFTazs, any chance you can make the changes requested? I can help merge this once this code is ready, but honestly not as well versed in this code, so @dsu1995 would be great if you can also review after changes are made.

@emanuelevivoli
Copy link

Any updates? I'm interested in having Unicode support, too.

@shotofcovfefe
Copy link

This thread has been stagnant for some time now, however, I'd really appreciate seeing it merged!

@PinguDEV-original
Copy link

Yeah please!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Jupyter notebook crashes when using dam_lev with transpose_costs Spezial characters
6 participants