Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GUL -- Global Unicode Lock #13

Open
tiran opened this issue Jun 3, 2016 · 2 comments
Open

GUL -- Global Unicode Lock #13

tiran opened this issue Jun 3, 2016 · 2 comments

Comments

@tiran
Copy link

tiran commented Jun 3, 2016

Although Python's str type is immutable from Python space, it is a C-mutable object. PyUnicodeObject has several writeable members. In fact PyUnicodeObject's payload itself is writeable from C code when condition Py_REFCNT(o) == 1. @larryhastings and I agree that a per-object lock for str is too costly. Instead we like to go with an optimistic global unicode lock.

Disclaimer: I don't fully understand the details of the current implementation and PEP 393.

https://www.python.org/dev/peps/pep-0393/#specification

Py_hash_t hash

The hash members caches the hash value for hash('somestring'). It is only computed on demand. Since hash doesn't involve any storage, no locking is required. At worst two threads compute the same hash value and override each other.

writeable data members

  • wchar_t *wstr (PyASCIIObject)
  • char *utf8 (PyCompactUnicodeObject)
  • PyUnicodeObject.data

Write access to any and all C-mutable members, that involve memory allocation, must be synchronized by the GUL. Otherwise two threads may set the same pointer, which result in a memory leak of one of the allocated buffers. My gut feeling tells me that conflicts are scarce, so optimistic locking is going to perform better here.

  • Check if utf8 member is already set
  • When utf8 member is not set, compute UTF-8 value
  • acquire GUL
  • Check again of another thread has set utf8 member in the mean time.
    • if utf8 member is still NULL, set member
    • if utf8 member has been set by another thread, discard and free UTF-8 value
  • release GUL

special casing of Py_REFCNT() == 1

Python's str uses a special case to optimize string concatenation and in _PyUnicodeWriter. As far as I am able to figure out _PyUnicodeWriter, it requires the special case to work. I'm not yet sure how to handle this special case. I have been considering a new flag constructable which can be set if-and-only-if a PyUnicodeObject is in C API calls in a single thread. struct state has unused 24 bits left.

WIP branch

I have started a branch but gave up after a couple of hours, https://github.com/tiran/gilectomy/tree/gul

@DemiMarie
Copy link

Can we make PyUnicodeObject truly immutable? As I understand it PyPy does just that.

@tiran
Copy link
Author

tiran commented Nov 28, 2016

No, it's not easily possible. AFAIK PyPy does not implement the trick where a single-reference PyUnicodeObject is mutable as long as it has not escaped into Python space. CPython's PyUnicodeObject is mutable in more ways. For instance each PyUnicodeObject can hold multiple optional representations of its data, e.g. an additional UTF-8 representation. The case is explained in the paragraph writeable data members. We can't get rid of the additional members w/o a major rewrite, API breakage and performance decrease.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants