Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raw_html removes content within self closing tags #132

Closed
navinpeiris opened this issue Aug 3, 2017 · 4 comments
Closed

raw_html removes content within self closing tags #132

navinpeiris opened this issue Aug 3, 2017 · 4 comments

Comments

@navinpeiris
Copy link
Contributor

When using raw_html with self closing tags that contain some text, the said text is lost in the output. This becomes problematic when trying to get the xml content for rss items etc.

For example, using raw_html after parsing the following xml:

<item>
  <link>www.example.com</link>
</item>

results in:

<item>
  <link/></link>
</item>
@navinpeiris
Copy link
Contributor Author

Pull requested submitted for this issue: #131

@mischov
Copy link
Contributor

mischov commented Aug 3, 2017

In HTML5 void elements (which <link> is) cannot have closing tags, and consequently cannot have contents. https://html.spec.whatwg.org/multipage/syntax.html#elements-2.

I believe void elements should get closed automatically by HTML5 parsers, meaning that if they have contents, it will be parsed as the contents of the void element's parent.

In your case, this means that the correct HTML5 parsing of

<item>
  <link>www.example.com</link>
</item>

would be

<item>
  <link /> www.example.com
</item>

As you say, this causes problems when trying to parse XML, but at the moment Floki only supports HTML.

@mischov
Copy link
Contributor

mischov commented Aug 3, 2017

That said, your fix suggests that the data is being incorrectly parsed (or at least, not parsed according to HTML5) by mochiweb_html, then just dropped by raw_html, so your fix is a good one in this particular case (it is the same approach I took in Meeseeks).

@philss
Copy link
Owner

philss commented Aug 4, 2017

Yes, I'm assuming that his fix is meant to represent HTML that was wrongly parsed. I think it's OK to support this scenario, even if it's representing an "invalid" HTML5 since it is what was the parsed tree.

Thank you @navinpeiris and @mischov! 😃

@philss philss closed this as completed Aug 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants