Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identifying unexpected token error positions #7

Open
SeanDS opened this issue Mar 5, 2021 · 4 comments
Open

Identifying unexpected token error positions #7

SeanDS opened this issue Mar 5, 2021 · 4 comments

Comments

@SeanDS
Copy link

SeanDS commented Mar 5, 2021

I hope it's ok to ask a general question about pegen here. I've built a parser following the blog posts and code here, and it generally works really nicely but one thing I found missing in the blog is how to handle unexpected tokens. As far as I understand, the recursive descent parser will continue to backtrack on unexpected input until it reaches the first rule again, unless you define an explicit rule to handle particular errors. I assume there is some strategy to identify the token that caused the error, like with how Python's parser knows the error in the following line is the *:

>>> 1*
  File "<stdin>", line 1
    1*
     ^
SyntaxError: invalid syntax

How/where is pegen handling this sort of error? Or, if pegen doesn't handle this error, where is Python's parser handling it (since I know it does!)?

Aside: while trying a different invalid Python syntax example I got something unexpected:

$ echo "hi(" -n | python -

With 3.9.2 this gives the error

  File "<stdin>", line 2
    
    ^
SyntaxError: unexpected EOF while parsing

which seems to be not showing the line with the error (but it's marking it). The same behaviour happens when hi( is in a file. Is this a bug with Python's new parser's error handling (and if so, should I report it)?

@pablogsal
Copy link
Contributor

How/where is pegen handling this sort of error? Or, if pegen doesn't handle this error, where is Python's parser handling it (since I know it does!)?

We inject invalid rules and we have a mechanism in the C parser to abort the backtracking. For instance, check:

https://github.com/python/cpython/blob/master/Grammar/python.gram#L106

which seems to be not showing the line with the error (but it's marking it). The same behaviour happens when hi( is in a file. Is this a bug with Python's new parser's error handling (and if so, should I report it)?

That is a tokenizer error, the tokenizer has reached the end of the source while expecting more tokens. This error has been improved in Python3.10 (when is possible to retain the source, like when using "-c"):

 ./python.exe -c "hi("
  File "<string>", line 1
    hi(
      ^
SyntaxError: '(' was never close

@gvanrossum
Copy link

Without special error rules you can still do a decent job. Just make the error point at the last token read (assuming your tokenizer is “lazy”, i.e. only tokenized as far as needed by the parser). In most cases this gives adequate errors.

@SeanDS
Copy link
Author

SeanDS commented Mar 5, 2021

@pablogsal, I was rather looking for information on how the parser handles cases where there isn't a special error rule.
But thanks for the input on the tokeniser error, and glad to hear it's apparently fixed in 3.10.

@gvanrossum Thanks, that works. Earlier I found the part of the code in the repo here that is doing what you say, and I just finished getting it to work for my parser - seems to do the job!

I really enjoyed reading the blogs BTW. If I hadn't found them I'd still be hacking away at my project's old LALR(1) parser.

@pablogsal
Copy link
Contributor

pablogsal commented Mar 5, 2021

and glad to hear it's apparently fixed in 3.10.

Is not technically "fixed" but "improved". Notice the old error is still correct: there was an unexpected end of file token while parsing. With our parser, is not always easy to emit the improved version because we don't have all the text that we parsed (for instance, when reading from stdin).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants