Skip to content
This repository has been archived by the owner on Mar 8, 2020. It is now read-only.

Fix other escape strings from JS #81

Merged
merged 5 commits into from
Apr 18, 2019
Merged

Conversation

bzz
Copy link
Contributor

@bzz bzz commented Mar 29, 2019

Fixes the rest of the #75 - case of check: key "value": invalid syntax ("\ ")"

Details of the applied solution are described in #81 (comment)

@bzz bzz self-assigned this Mar 29, 2019
Copy link
Contributor

@creachadair creachadair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please expand the description before merging to say what cases this PR is fixing?

This looks fine so far, but since it's marked WIP I'll hold off on approving till you're ready.

driver/normalizer/strconv_test.go Outdated Show resolved Hide resolved
fixtures/string-literal.js Outdated Show resolved Hide resolved
@bzz
Copy link
Contributor Author

bzz commented Apr 2, 2019

Rebased on laster master and early feedback amended.

Right now it's still WIP as it only contains the reproducible example + tests but no a fix yet.

@bzz
Copy link
Contributor Author

bzz commented Apr 9, 2019

rebased on latest master

@bzz
Copy link
Contributor Author

bzz commented Apr 9, 2019

As we now have discovered many edge cases where escape sequence handing in the native driver is very different from Go - current approach from #64 does not seem to scale well, so a proposed solution that we discussed with @dennwc yesterday is:

  • in native/annotated mode: always take un-normalized (or quoted) value as token (includes starting \", \' and \n, etc in content)
  • in semantic mode, always take normalized (or unquoted) value as token (does not include those characters)

That would allow us to avoid re-implementing complex JS escape sequence handling on the Go side.

Although this seems to be consistent with initial intent #32 (comment) and seems to be the case for other drivers, I'm not sure how would that affect client's expectations (e.g see #32 and esp bblfsh/sdk#291) and want all of us to be on board, so \cc @creachadair and @dennwc

@creachadair
Copy link
Contributor

  • in native/annotated mode: always take un-normalized value as token (includes \", \', etc)
  • in semantic mode, always take normalized value as toke (does not include such characters)

That would allow us to avoid re-implementing complex JS escape sequence handling on the Go side.

That seems reasonable to me. I also like about it that its easy to understand which to expect: If you're upstream of canonicalization, you get the ambient encoding, downstream you get the normalized form.

Does this mean, though, that we will have to explicitly define the string encoding rules we use inside the canonical AST? (I didn't find it anywhere, and it looks like right now we implicitly assume Go's rules).

@bzz
Copy link
Contributor Author

bzz commented Apr 10, 2019

I guess that is something I'm trying to understand here as well.

After a series of experiments, I learned that go, java, bash, cpp, csharp are the only drivers where the original quotas used for literals are preserved in annotated mode.

php and python do not preserve that information and thus 'b\ncq' shows up as "@token":"b\ncq","@type":"Str"

@bzz
Copy link
Contributor Author

bzz commented Apr 10, 2019

The format below is:

string literal
and it's represenation in

annotated
semantc

modes.


go

"b\nc"

"@token":"\"b\\nc\""
"@type":"uast:String","Format":"","Value":"b\nc"

java

"b\nc"

"@token":"\"b\\nc\""
"@type":"uast:String","Format":"","Value":"b\nc"

javascript

"b\nc"

"@token":"\"b\\nc\""
"@type":"uast:String","Format":"","Value":"b\nc"

'b\ncq'

"@token":"'b\\ncq'"
"@type":"uast:String","Format":"single","Value":"b\ncq"

bash

"b\nc"

"@token":"\"b\\nc\"","@type":"string","children":["@role":["Expression","String","Literal"],"@token":"b\\nc","@type":"string_content"]
"@type":"uast:String","Format":"","Value":"b\\nc"

'b\ncq'

"@token":"'b\\ncq'","@type":"unevaluated_string2","children":[]
"@type":"uast:String","Format":"","Value":"'b\\ncq'"

cpp

"b\nc" (not sure yet, why is it twice :/ steps to reproduce in details)

"@token":"\"b\\nc\"","@type":"CPPASTLiteralExpression","ExpressionType":"const char [4]","ExpressionValueCategory":"LVALUE","IsLValue":true,"kind":"string_literal"
"@type":"uast:String","Format":"","Value":"b\nc"

"@token":"\"b\\nc\"","@type":"CPPASTLiteralExpression","ExpressionType":"const char [4]","ExpressionValueCategory":"LVALUE","IsLValue":true,"kind":"string_literal"
"@type":"uast:String","Format":"","Value":"b\nc"
const char* b = 'b';
const char* bc = "b\nc";
 bblfsh-cli -o json -l cpp -m annotated \
    -q "//*[@role='String' and @role='Literal']" cpp.cpp \
    | jq -c '.[] | del(."@pos") | del(.Token."@pos") | del(."@role")'

 bblfsh-cli -o json -l cpp -m semantic \
    -q '//uast:String' cpp.cpp \
   | jq -c '.[] | del(."@pos")'

csharp

"b\nc"

"@type":"StringLiteralExpression","IsMissing":false,"IsStructuredTrivia":false,"Token":{"@role":["Literal","String"],"@type":"StringLiteralToken","IsMissing":false,"Text":"\"b\\nc\"","Value":"b\nc","ValueText":"b\nc"}

"@type":"uast:String","Format":"","Value":"b\nc"

php

"b\nc"

"@token":"b\nc","@type":"Scalar_String","attributes":{"kind":2}
"@type":"uast:String","Format":"","Value":"b\nc"

'b\ncq'

"@token":"b\\ncq","@type":"Scalar_String","attributes":{"kind":1}
"@type":"uast:String","Format":"raw","Value":"b\\ncq"

python

"b\nc"

"@token":"b\nc","@type":"Str"
"@type":"uast:String","Format":"","Value":"b\nc"

'b\ncq'

"@token":"b\ncq","@type":"Str"
"@type":"uast:String","Format":"","Value":"b\ncq"

@dennwc
Copy link
Member

dennwc commented Apr 10, 2019

  • Semantic in Bash is incorrect (should be b\ncq)
  • Annotated in PHP and Python is incorrect (should be \"b\\ncq\" and 'b\\ncq')

Everything else looks good.

In Annotated the string should match the source file (including quotes). In Semantic there should be no quotes and the string should be unescaped according to the language rules (same as if you print it from that language).

@bzz
Copy link
Contributor Author

bzz commented Apr 11, 2019

I'm using this as kind-of umbrella issue for these type of problems, so will create new appropriate issues for Bash, PHP and Python were created.

So, the same approach will be used as solution for:

The proposed solution:

  • keep annotated mode as it is now
  • in semantic mode, instead of unquoting on Go side the value from annotated mode (leads to these bugs), we'll just use a different AST node that already provides a normalized by a native driver string, according the language specific rules.

bzz added 2 commits April 15, 2019 15:50
Fixes bblfsh#75

Annotated:
 - New field added to the string literals `normValue`
   It is used as value only in Semnatic mode.

Explicit Go-side strconv between annotated and semantic
is not neeed this way.

Signed-off-by: Alexander Bezzubov <[email protected]>
@bzz bzz changed the title [WIP] Fix other escape strings from JS Fix other escape strings from JS Apr 15, 2019
@bzz bzz requested review from dennwc and creachadair April 15, 2019 13:54
@bzz
Copy link
Contributor Author

bzz commented Apr 15, 2019

Fix applied in 683324b :

  • in Annotated, rawValue is now preserved as normValue (contains normalised string, previously discarded)
  • in Semantic, format: single is kept the same way but normValue is used instead of manual unquoting

All fixtures updated in fbed224.
@dennwc @creachadair ci is great and it's ready for review now.

Copy link
Member

@dennwc dennwc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small change required but looks good otherwise.

driver/normalizer/normalizer.go Outdated Show resolved Hide resolved
@bzz
Copy link
Contributor Author

bzz commented Apr 17, 2019

All feedback addressed in fc57c53 - could you please give it a final glance, @dennwc ?

Copy link
Member

@dennwc dennwc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly what I had in mind, thanks!

driver/normalizer/normalizer.go Outdated Show resolved Hide resolved
@bzz
Copy link
Contributor Author

bzz commented Apr 18, 2019

Thank you every one for reviews! Merging after CI is green

@bzz bzz merged commit 9470950 into bblfsh:master Apr 18, 2019
@bzz bzz deleted the fix/other-escapes-75 branch April 18, 2019 11:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants