Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Floki removes blank text nodes without option to avoid this #75

Open
Eiji7 opened this issue Dec 15, 2016 · 11 comments
Open

Floki removes blank text nodes without option to avoid this #75

Eiji7 opened this issue Dec 15, 2016 · 11 comments
Labels

Comments

@Eiji7
Copy link

Eiji7 commented Dec 15, 2016

Actual results:

Floki.parse("<span>5</span> <span>=</span> <span>5</span>")  
[{"span", [], ["5"]}, {"span", [], ["="]}, {"span", [], ["5"]}]

Expected results:

Floki.parse("<span>5</span> <span>=</span> <span>5</span>")  
# or
Floki.parse("<span>5</span> <span>=</span> <span>5</span>", remove_text_nodes_with_whitespaces: false)

[{"span", [], ["5"]}, " ", {"span", [], ["="]}, " ", {"span", [], ["5"]}]

Note: this is really important in some cases. For example, please try parsing html generated from github markup (code samples).

@philss
Copy link
Owner

philss commented Dec 18, 2016

Hi @Eiji7.

This issue is a known issue inside Mochiweb HTML parser, which is the parser behind Floki.

The suggested workaround would replace white-spaces between tags with a &#32; or &#x20; before calling :mochiweb.parse/1.

I tried some expressions to replace and I came up with this:

"<span>5</span> <span>=</span> <span>5</span>"
|> String.replace(~r/>[ \n\r]+</, ">&#32;<")
|> Floki.parse

# => [{"span", [], ["5"]}, " ", {"span", [], ["="]}, " ", {"span", [], ["5"]}]

Not sure if this is safe to do inside Floki.parse/2, specially because this would affect code inside script tags.

Maybe the option you are suggesting could come as true by default only to preserve the current behavior that is broken, but is already known. What do you think? Do you have suggestions?
Thanks!

@Eiji7
Copy link
Author

Eiji7 commented Dec 19, 2016

Edit:
@philss: Looks better, but it should be already implemented, so developers not need to care about replace.

@aphillipo
Copy link

@Eiji7 Surely it's going to add a lot of overhead if you don't need this though? Maybe it should be Floki.preserve_inter_tag_spaces and documented to say that it will copy the entire string to do this.

@Eiji7
Copy link
Author

Eiji7 commented Dec 29, 2016

@aphillipo: yup, this way or #37.
btw. If you can explain me how parsers should work, like:

best way is to create regex, you can find sample/demo patterns at ...

then I could think about create an XML/HTML parser and validator in Elixir. I can start work on it now, but I don't worked on similar project, so I can't provide others that it will be really fast. As listed in 3rd point at #37 issue we need think about a good algorithm to avoid decrease of efficiency

@aphillipo
Copy link

Building HTML parsers is very difficult - I want to use Servo's for this eventually...

https://github.com/servo/html5ever

@Eiji7
Copy link
Author

Eiji7 commented Dec 29, 2016

@philss and @aphillipo: Did you know :xmerl and :xmerl_scan?
For example see erlang docs.
Maybe it's what we are looking for that we can easily extend?

Erlang/OTP 19 [erts-8.1] [source] [64-bit] [smp:4:4] [async-threads:10]

Interactive Elixir (1.3.4) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> :xmerl_scan.string('<p><span>5</span> <span>=</span> <span>5</span></p>')  
{{:xmlElement, :p, :p, [], {:xmlNamespace, [], []}, [], 1, [],
  [{:xmlElement, :span, :span, [], {:xmlNamespace, [], []}, [p: 1], 1, [],
    [{:xmlText, [span: 1, p: 1], 1, [], '5', :text}], [], '/home/eiji',
    :undeclared}, {:xmlText, [p: 1], 2, [], ' ', :text},
   {:xmlElement, :span, :span, [], {:xmlNamespace, [], []}, [p: 1], 3, [],
    [{:xmlText, [span: 3, p: 1], 1, [], '=', :text}], [], :undefined,
    :undeclared}, {:xmlText, [p: 1], 4, [], ' ', :text},
   {:xmlElement, :span, :span, [], {:xmlNamespace, [], []}, [p: 1], 5, [],
    [{:xmlText, [span: 5, p: 1], 1, [], '5', :text}], [], :undefined,
    :undeclared}], [], '/home/eiji', :undeclared}, []}

What do you think about it?

@philss
Copy link
Owner

philss commented Dec 29, 2016

@Eiji7 :xmerl won't work for us. This is because xmerl was made to parse XML documents, and HTML is very less strict than XML. Normally HTML parsers are more forgiven than XML parsers, due the reason that users may write invalid HTML that should be presented in the browsers.

Here are some references:

One of the possibilities to fix this is to fix mochiweb's HTML parser.

As @aphillipo said, HTML parsers are very difficult to build. I created that issue (#37) without understand that. Now I'm more inclined to the idea of reuse an existing HTML parser. Servo's html5ever is my favorite choice for now.

I did not started working on this, but help trying out this "bridge" between Elixir and Rust would be very welcome! Here is a project that is very promising: https://github.com/hansihe/Rustler.

@Eiji7
Copy link
Author

Eiji7 commented Dec 30, 2016

@philss: Ok, I see your point. I see that your way ("bridge") is faster to implement. I don't know how exactly will work. Will it be as fast (with that "bridge") as native Elixir parser? It may be interesting for me in further future.
Anyway I'm interested in :xmerl and I think that I will look at its source files and I will probably create a Elixir version of it that will work for more that one document format (XML, HTML, XHTML and maybe also HAML).

@philss philss added the Bug label Jan 12, 2017
@philss
Copy link
Owner

philss commented Jan 12, 2017

@Eiji7 sorry for the delay.

I'm not sure how it will work. It does not seems to have a big performance gap when using, for example, an Erlang NIF written in C or Rust. The problems are more related to the security of code. If the C/Rust code crash, the entire Erlang VM can crash. This is one of the reasons I want to give Rust a try with html5ever (because Rust promises a more safe environment for your programs).

Cool! The idea to write a HTML parser entirely in Elixir is not dead yet! It is only very hard to do the "right way".

I will let this open as a "bug" and try to figure out how to "quick fix" this.

Thanks!

@dariodf
Copy link

dariodf commented Mar 4, 2018

I tried to use html5ever and i got too many errors when parsing.

So here's a (quite expensive) solution for this problem:

  defp replace_whitespaces(text, last_text \\ "")
  defp replace_whitespaces(text, last_text) when last_text == text, do: text
  defp replace_whitespaces(text, _) do
    text
    |> String.replace(~r/>((&#32;)*) ( *)</, ">\\g{1}&#32;\\g{3}<")
    |> replace_whitespaces(text)
  end

Then, you use this function before your find and text:

    body
    |> replace_whitespaces
    |> Floki.find(selector)
    |> Floki.text

@lud-wj
Copy link

lud-wj commented Feb 5, 2024

Hello,

I am having the same problem. Now it seems that the mochiweb parser is distributed in floki, so the problem could be fixed directly instead of relying on third party (which does not exist anymore AFAICT).

I have a workaround with html5ever that works for small payloads, I'm gonna test it with huge, dirty HTML chunks. But it could be nice to have an option to keep all the text nodes regardless of their contents, don't you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants