Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using entities in XML and MathML #2144

Open
Omikhleia opened this issue Oct 20, 2024 · 1 comment
Open

Using entities in XML and MathML #2144

Omikhleia opened this issue Oct 20, 2024 · 1 comment
Labels
enhancement Software improvement or feature request inputters:xml

Comments

@Omikhleia
Copy link
Member

Omikhleia commented Oct 20, 2024

This relates to MathML, but raises some more interesting points regarding the "general" parsing of XML (#2111)...

Context

MathML in SIL-XML, with formula obtained from an external source...

<document>
<mathml mode="display">
  <mrow>
    <mrow>
      <mo>&lang;</mo>
      <mrow>
        <mi>&psi;</mi>
        <mspace width="0.17em" />
        <mrow>
....

SILE (well actually our lxp parser) errors: ! undefined entity at math-showcase/mathml/joe10.xml

Workarounds

How to possibly support MathML formula using HTML/MathML entities, the list of which is quite big. Did I say "big"? (the latter even has a discussion on phi / varphi etc.)...

So...

  1. One can replace all (HTML) entities from the MathML original file (either by their symbol or their &#xXXXX code point... but it's cumbersome and tedious in any reasonable workflow...
  2. One can hack the inputter so as to search-and-replace entities before XML parsing... but it's crazy performance-wise and sounds rather dumb (having to substitute strings in a whole document, before parsing it? No way!)
  3. One can add a DOCTYPE to the document, such as:
    <!DOCTYPE document [
      <!ENTITY times "&#x00D7;">
      <!ENTITY lang "&#x27E8;">
      <!ENTITY psi "&#x03C8;">
     ...
    ]>
    
    ... But it's also crazy and cumbersome.
  4. One can hack the inputter to stuff that big DTD automatically at the top of the content before parsing... But that's not ideal too performance-wise (to have lua-expat parse again and again the same in-text DTD...)

A real solution?

  • Add <!DOCTYPE document SYSTEM "sil.dtd"> at the top of the content, if it's absent... Users might even have a customized one:

    <!DOCTYPE document SYSTEM "sil.dtd" [
      <!ENTITY resilient "re·sil·ient"><!-- I'm so lazy -->
     ...
    ]>
    
  • And use a modified XML parser...

    local function parse (doc)
       local content = {
          StartElement = startcommand,
          EndElement = endcommand,
          CharacterData = text,
          SkippedEntity = function(parser, name, isParameter) 
             local msg = MyAweSomeMappingOfEntities[name] or SU.error("Unknown entity: " .. name)
             text(parser, msg)
          end,
          NotStandalone = function(parser)
             return true
          end,
          _nonstrict = true,
          stack = { {} },
       }
       local parser = lxp.new(content)
    

The key point here is to enforce NotStandalone, and provide a SkippedEntity handler that does the replacements with a table... Extensible, flexible, clever performance-wise, and still allowing explicit DTD entity declaration as override.

But of course, we don't want to do this for any random XML document. Those might have their own entities, not the HTML ones... And some of the ideas mentioned in #2111 (dedicated XML inputters with possibly other schema-based rules on space handling etc.) is perhaps even more sound than ever...

Any opinions on the topic, before I start hacking as a madman ? ;)

(EDIT: Fixed the SkippedEntity code example)

@Omikhleia Omikhleia added enhancement Software improvement or feature request inputters:xml labels Oct 20, 2024
@Omikhleia Omikhleia added this to Math Oct 20, 2024
@github-project-automation github-project-automation bot moved this to To do in Math Oct 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Software improvement or feature request inputters:xml
Projects
Status: To do
Development

No branches or pull requests

2 participants
@Omikhleia and others