feature request: provide Unicode positions #102

vmarkovtsev · 2019-03-20T15:45:25Z

As we have recently found out, the positions of UAST nodes must be measured in bytes, not in runes. However, we require working with strings, so we had to build the conversion ourselves. This code is not straightforward.
Besides, all kinds of crazy things may happen, e.g. the line number can change.

I suggest adding the function to recalculate positions to Babelfish, either as an API extension or to the clients. It is not cool to calculate them by hand, and they are different from IDE (expected) positions anyway.

bzz · 2019-03-20T16:37:47Z

Not sure how generally applicable is that across all the drivers, but I know @creachadair rised this concern before and @dennwc did some work on it e.g at js driver, in order to make this happen bblfsh/javascript-driver#51

dennwc · 2019-03-20T16:43:00Z

This was a reverse operation: Babelfish expects bytes positions, but JS was writing Unicode positions.

The feature request makes total sense, but I think we can't add Unicode positions as a field - it will make UAST files even larger.

So I propose to send the whole source file as a part of UAST. Then we can recalculate Unicode positions and even tokens on demand.

vmarkovtsev · 2019-03-20T17:06:49Z

It is also possible to add a parameter to the parse request that the client wants Unicode positions. Normally you don't need both at the same time.

dennwc · 2019-03-20T17:09:43Z

This makes sense. We can extend the parsing protocol with one more option for this. Still, the Position node will need some flag to signal that it's stored as Unicode position.

dennwc · 2019-05-15T15:07:12Z

As per discussion on Slack (non-public link), this sounds like a "wontfix" for the SDK:

@creachadair: The representation of position is tricky: Even languages that use Unicode internally do not necessarily agree on the distinction between codepoint offset and character. For different normalizations you may well get different offsets, and whether or not the normalization happens implicitly varies.

Byte offset is the only reliable universal baseline, and so the real question IMO is how to get codepoint offsets into the API. Personally, I think it makes sense to put the code for transforming a (bytes, byte-offset) pair into a codepoint-offset, line number, column offset, etc., in a common location. But I do not think it makes sense to plumb a bunch of switches through the APIs for this.

@vmarkovtsev: I wonder if it will help to force a specific normalization form in case Unicode positions are requested - that is, to return a different byte representation which does not have gotchas.

@creachadair: That might help in some cases, but not all: In some cases converting to a particular normal form requires lossy transformations. If you only need to go "forward" (e.g., read the file and answer questions about it) that may be OK. But if you want to apply suggestions or fixes to a file, the positions may no longer match after transformation.

Worse: After normalization the byte-for-byte encoding may change length, which means even the byte offsets get moved.

So: Yes, it's possible to force a specific encoding, lossily, but it's only safe in one direction.

And yes: Unicode is a mess.

vmarkovtsev · 2019-05-15T16:15:02Z

@dennwc Shall we move this request to the Python client then? As I wrote, somebody has to calculate the positions due to the "business logic" requirements, and I am not happy that this is currently the ML team's responsibility.

I mean, we can say that this is wontfix for clever and legitimate technical reasons (which happen in 0.01% of all the cases though), but then something will break on our side, and the product will be broken, too. cc @smola

dennwc · 2019-05-15T16:16:49Z

@vmarkovtsev I only meant the SDK. This should probably be in clients or libuast.

vmarkovtsev · 2019-05-15T16:19:01Z

BTW I am perfectly fine if the positions calculator sometimes responds that it cannot infer Unicode positions reliably (because of crazy normalization, etc.) and returns nil. We could simply skip such complex files at roughly no practical loss.

dennwc · 2019-05-15T16:20:48Z

The problem is only that we can't provide those positions in the UAST, but we can always calculate them separately. This way the user won't be able to change positions directly and we won't hit cases mentioned by @creachadair.

bzz · 2019-05-20T16:36:40Z

@dennwc @creachadair how do you think, what would be the best place to move this issue?
I would suggest python-client first and then, if it proves to be required, implement the decided logic in libuast for other clients to benefit. WDYT?

creachadair · 2019-05-20T18:16:11Z

@dennwc @creachadair how do you think, what would be the best place to move this issue?
I would suggest python-client first and then, if it proves to be required, implement the decided logic in libuast for other clients to benefit. WDYT?

That seems like a reasonable approach. As long as we're careful not to make the API too specific to Python, we should be able to shift it into the shared library later.

dennwc · 2019-05-21T00:04:30Z

This will be done in libuast for Python, so it makes sense to move the issue there. Doing it now.

dennwc · 2019-05-21T00:07:41Z

Not sure why, but I don't have permissions to move it to libuast specifically. Can move to other repositories, though.

@creachadair Can you please check that permissions on SDK, libuast, bblfshd, clients and drivers are the same? Thanks!

creachadair · 2019-05-23T16:40:12Z

@creachadair Can you please check that permissions on SDK, libuast, bblfshd, clients and drivers are the same? Thanks!

Done. It turns out libuast had Maintainers set to "Write" instead of "Admin". I have now fixed it, and they all three have "Admin" for that group.

Signed-off-by: Denys Smirnov <[email protected]>

dennwc self-assigned this May 21, 2019

bzz transferred this issue from bblfsh/sdk May 23, 2019

bzz mentioned this issue May 29, 2019

Node position offset is not bytes bblfsh/java-driver#122

Closed

dennwc mentioned this issue May 30, 2019

Add reverse conversions from UTF-8/16 offsets bblfsh/sdk#403

Merged

dennwc pushed a commit to dennwc/libuast that referenced this issue May 30, 2019

expose positional index to clients; addresses bblfsh#102

b44d9ee

Signed-off-by: Denys Smirnov <[email protected]>

dennwc mentioned this issue May 30, 2019

Expose positional index #103

Merged

dennwc pushed a commit to dennwc/libuast that referenced this issue Jun 3, 2019

expose positional index to clients; addresses bblfsh#102

ed8a93c

Signed-off-by: Denys Smirnov <[email protected]>

dennwc mentioned this issue Jun 5, 2019

Support conversion to Unicode line-column positions bblfsh/sdk#405

Merged

dennwc pushed a commit to dennwc/libuast that referenced this issue Jun 6, 2019

expose positional index to clients; addresses bblfsh#102

527e6bd

Signed-off-by: Denys Smirnov <[email protected]>

dennwc pushed a commit to dennwc/libuast that referenced this issue Jun 13, 2019

expose positional index to clients; addresses bblfsh#102

03e63dd

Signed-off-by: Denys Smirnov <[email protected]>

dennwc pushed a commit that referenced this issue Jun 20, 2019

expose positional index to clients; addresses #102

0999a6e

Signed-off-by: Denys Smirnov <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request: provide Unicode positions #102

feature request: provide Unicode positions #102

vmarkovtsev commented Mar 20, 2019

bzz commented Mar 20, 2019 •

edited

Loading

dennwc commented Mar 20, 2019

vmarkovtsev commented Mar 20, 2019

dennwc commented Mar 20, 2019

dennwc commented May 15, 2019

vmarkovtsev commented May 15, 2019 •

edited

Loading

dennwc commented May 15, 2019

vmarkovtsev commented May 15, 2019

dennwc commented May 15, 2019

bzz commented May 20, 2019

creachadair commented May 20, 2019

dennwc commented May 21, 2019

dennwc commented May 21, 2019

creachadair commented May 23, 2019

feature request: provide Unicode positions #102

feature request: provide Unicode positions #102

Comments

vmarkovtsev commented Mar 20, 2019

bzz commented Mar 20, 2019 • edited Loading

dennwc commented Mar 20, 2019

vmarkovtsev commented Mar 20, 2019

dennwc commented Mar 20, 2019

dennwc commented May 15, 2019

vmarkovtsev commented May 15, 2019 • edited Loading

dennwc commented May 15, 2019

vmarkovtsev commented May 15, 2019

dennwc commented May 15, 2019

bzz commented May 20, 2019

creachadair commented May 20, 2019

dennwc commented May 21, 2019

dennwc commented May 21, 2019

creachadair commented May 23, 2019

bzz commented Mar 20, 2019 •

edited

Loading

vmarkovtsev commented May 15, 2019 •

edited

Loading