Skip to content
This repository has been archived by the owner on Mar 8, 2020. It is now read-only.

feature request: provide Unicode positions #102

Open
vmarkovtsev opened this issue Mar 20, 2019 · 14 comments
Open

feature request: provide Unicode positions #102

vmarkovtsev opened this issue Mar 20, 2019 · 14 comments
Assignees

Comments

@vmarkovtsev
Copy link
Contributor

As we have recently found out, the positions of UAST nodes must be measured in bytes, not in runes. However, we require working with strings, so we had to build the conversion ourselves. This code is not straightforward.
Besides, all kinds of crazy things may happen, e.g. the line number can change.

I suggest adding the function to recalculate positions to Babelfish, either as an API extension or to the clients. It is not cool to calculate them by hand, and they are different from IDE (expected) positions anyway.

@bzz
Copy link
Contributor

bzz commented Mar 20, 2019

Not sure how generally applicable is that across all the drivers, but I know @creachadair rised this concern before and @dennwc did some work on it e.g at js driver, in order to make this happen bblfsh/javascript-driver#51

@dennwc
Copy link
Member

dennwc commented Mar 20, 2019

This was a reverse operation: Babelfish expects bytes positions, but JS was writing Unicode positions.

The feature request makes total sense, but I think we can't add Unicode positions as a field - it will make UAST files even larger.

So I propose to send the whole source file as a part of UAST. Then we can recalculate Unicode positions and even tokens on demand.

@vmarkovtsev
Copy link
Contributor Author

It is also possible to add a parameter to the parse request that the client wants Unicode positions. Normally you don't need both at the same time.

@dennwc
Copy link
Member

dennwc commented Mar 20, 2019

This makes sense. We can extend the parsing protocol with one more option for this. Still, the Position node will need some flag to signal that it's stored as Unicode position.

@dennwc
Copy link
Member

dennwc commented May 15, 2019

As per discussion on Slack (non-public link), this sounds like a "wontfix" for the SDK:

@creachadair: The representation of position is tricky: Even languages that use Unicode internally do not necessarily agree on the distinction between codepoint offset and character. For different normalizations you may well get different offsets, and whether or not the normalization happens implicitly varies.

Byte offset is the only reliable universal baseline, and so the real question IMO is how to get codepoint offsets into the API. Personally, I think it makes sense to put the code for transforming a (bytes, byte-offset) pair into a codepoint-offset, line number, column offset, etc., in a common location. But I do not think it makes sense to plumb a bunch of switches through the APIs for this.

@vmarkovtsev: I wonder if it will help to force a specific normalization form in case Unicode positions are requested - that is, to return a different byte representation which does not have gotchas.

@creachadair: That might help in some cases, but not all: In some cases converting to a particular normal form requires lossy transformations. If you only need to go "forward" (e.g., read the file and answer questions about it) that may be OK. But if you want to apply suggestions or fixes to a file, the positions may no longer match after transformation.

Worse: After normalization the byte-for-byte encoding may change length, which means even the byte offsets get moved.

So: Yes, it's possible to force a specific encoding, lossily, but it's only safe in one direction.

And yes: Unicode is a mess.

@vmarkovtsev
Copy link
Contributor Author

vmarkovtsev commented May 15, 2019

@dennwc Shall we move this request to the Python client then? As I wrote, somebody has to calculate the positions due to the "business logic" requirements, and I am not happy that this is currently the ML team's responsibility.

I mean, we can say that this is wontfix for clever and legitimate technical reasons (which happen in 0.01% of all the cases though), but then something will break on our side, and the product will be broken, too. cc @smola

@dennwc
Copy link
Member

dennwc commented May 15, 2019

@vmarkovtsev I only meant the SDK. This should probably be in clients or libuast.

@vmarkovtsev
Copy link
Contributor Author

BTW I am perfectly fine if the positions calculator sometimes responds that it cannot infer Unicode positions reliably (because of crazy normalization, etc.) and returns nil. We could simply skip such complex files at roughly no practical loss.

@dennwc
Copy link
Member

dennwc commented May 15, 2019

The problem is only that we can't provide those positions in the UAST, but we can always calculate them separately. This way the user won't be able to change positions directly and we won't hit cases mentioned by @creachadair.

@bzz
Copy link
Contributor

bzz commented May 20, 2019

@dennwc @creachadair how do you think, what would be the best place to move this issue?
I would suggest python-client first and then, if it proves to be required, implement the decided logic in libuast for other clients to benefit. WDYT?

@creachadair
Copy link
Contributor

@dennwc @creachadair how do you think, what would be the best place to move this issue?
I would suggest python-client first and then, if it proves to be required, implement the decided logic in libuast for other clients to benefit. WDYT?

That seems like a reasonable approach. As long as we're careful not to make the API too specific to Python, we should be able to shift it into the shared library later.

@dennwc
Copy link
Member

dennwc commented May 21, 2019

This will be done in libuast for Python, so it makes sense to move the issue there. Doing it now.

@dennwc dennwc self-assigned this May 21, 2019
@dennwc
Copy link
Member

dennwc commented May 21, 2019

Not sure why, but I don't have permissions to move it to libuast specifically. Can move to other repositories, though.

@creachadair Can you please check that permissions on SDK, libuast, bblfshd, clients and drivers are the same? Thanks!

@bzz bzz transferred this issue from bblfsh/sdk May 23, 2019
@creachadair
Copy link
Contributor

@creachadair Can you please check that permissions on SDK, libuast, bblfshd, clients and drivers are the same? Thanks!

Done. It turns out libuast had Maintainers set to "Write" instead of "Admin". I have now fixed it, and they all three have "Admin" for that group.

dennwc pushed a commit to dennwc/libuast that referenced this issue May 30, 2019
dennwc pushed a commit to dennwc/libuast that referenced this issue Jun 3, 2019
dennwc pushed a commit to dennwc/libuast that referenced this issue Jun 6, 2019
dennwc pushed a commit to dennwc/libuast that referenced this issue Jun 13, 2019
dennwc pushed a commit that referenced this issue Jun 20, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants