-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wip: mdoc reader #10225
base: main
Are you sure you want to change the base?
wip: mdoc reader #10225
Conversation
I'll try to start doing real commits for myself from now on
Replacing spacetab copied from Roff lexer
mandoc's roff(7) says "Blank text lines, which may include whitespace, are only permitted within literal contexts." mandoc -T lint warns about blank lines and inserts a roff `sp` request, which is handled differently depending on the output format. My read is that mandoc considers the handling of a blank line in non-literal context in mdoc(7) to be undefined.
Copy-pasted. Maybe they'll come back.
See mdoc(7) section MACRO SYNTAX
This will handle Ns in the future
There's a number of unique-looking cases for Fl parsing so I am just handling them very explicitly instead of trying to generalize anything enough to handle it.
Solves a delta with mandoc
the edge case of "Ap (" tested in this mandoc regress isn't present in any actual OpenBSD base system manuals, where Ap is only ever followed by a letter. Furthermore, "Ap" is generally uncommon compared to "Ns '" (e.g. ".Xr mandoc 1 Ns 's"). I'm accepting a difference from mandoc here because correctly suppressing space after the "(" here would require more refactoring than I feel like doing at time of writing.
It ends up with bad results in the ANSI writer, for example, because it then can't break lines at Spaces. This isn't wholly inconsistent with mandoc, because it makes no effort to render multiple consecutive spaces from the source document in HTML.
Getting to the point where I can start working with real manual pages so this is helpful.
A bit janky but worse things have happened.
Was having probalos
|
||
executable lexroff | ||
import: common-executable | ||
main-is: lexroff.hs | ||
build-depends: pandoc, text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Private test thingy, need to zap this.
@@ -547,6 +548,7 @@ library | |||
hs-source-dirs: src | |||
|
|||
exposed-modules: Text.Pandoc, | |||
Text.Pandoc.Readers.Roff, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exposed this for my lexroff.hs
thing which I didn't commit, move back to other-modules
before merge
I have only had a very cursory look, but one question that comes to mind is why you have a new lexer with a new kind of token. Is |
Part of it was just that I wanted to figure out how to implement this without having to kitbash the Roff lexer beyond recognition or keep the Man reader in sync with stuff that I changed, though I did end up extracting and reusing the escape sequences. But in a few ways the needs are fairly different. The token type used by While the mdoc format inherits the superficial elements of roff syntax and in GNU groff is still implemented as a package of roff macros, mdoc macros themselves have a more complicated syntax. See MACRO SYNTAX in mandoc's mdoc(7) manual. The upshot is that the arguments to many macros are themselves parsed for macro calls, and in turn many macros can be called in argument position. (Cf. "Callable"/"Parsed" attributes of each macro.) So the For example: .Sy hello Em world I lex this as parseSy = do
macro "Sy"
args <- manyTill lit (anyMacro <|> eol)
return $ strong $ mconcat $ intersperse space (map toString args)
parseEm = do
macro "Em"
args <- manyTill lit (anyMacro <|> eol)
return $ emph $ mconcat $ intersperse space (map toString args) If my token stream were of the existing roffTokenToMdocTokens (ControlLine nm args) = Macro nm : map litOrMacro args <> [Eol]
where
litOrMacro x | isParsedMacro nm && isCallableMacro x = Macro x
| otherwise = Lit x But! The existing roff lexer, by applying escapes, will have already destroyed some information we need if I take this approach. The following two lines will get the same lex from current .Sy hello Em world
.Sy hello \&Em world All of the above leaving aside the handling of delimiters required by Finally, the The bottom line of all this is that RoffToken and MdocToken are pretty different because the associated readers need different information from each control line. But all that being said, I guess it's plausible to at least base the lexers on some shared code by expanding on my (misnamed) |
I'd like to understand this better. I would have thought that low level roff stuff like escapes was common currency for man, ms, and mdoc. Can you explain further why we can't handle the escapes in the lexer as we were doing? |
We do continue to handle the escapes in the lexer, and I'm reusing all the escaping code from .Sy hello Em world
.Sy hello \&Em world The
So if we wanted to reuse the |
There's a substantial amount of work left to do here, but as I am going on vacation for four weeks on Monday and not bringing a computer with me it seems reasonable to put up a draft PR. I welcome feedback on what I've done so far.
closes #9056