Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

traverse_and_update is not applied to text nodes #338

Closed
ppatrzyk opened this issue Mar 6, 2021 · 4 comments
Closed

traverse_and_update is not applied to text nodes #338

ppatrzyk opened this issue Mar 6, 2021 · 4 comments
Labels

Comments

@ppatrzyk
Copy link

ppatrzyk commented Mar 6, 2021

Description

When using traverse_and_update, specified function is not applied to text nodes, these are passed as is:

def traverse_and_update(text, acc, _fun) when is_binary(text), do: {text, acc}

To Reproduce

My setup:

  • Ubuntu 20.04
  • Elixir 1.11.2
  • Erlang/OTP 23
  • floki 0.30.0
  • fast_html 2.0

config.exs:

config :floki,
  html_parser: Floki.HTMLParser.FastHtml
> html = """
<body>
    <div></div>
    <div>
         
    </div>
</body>
"""
"<body>\n    <div></div>\n    <div>\n         \n    </div>\n</body>\n"

> {:ok, doc} = Floki.parse_document(html)
{:ok,
 [
   {"html", [],
    [
      {"head", [], []},
      {"body", [],
       [
         "\n    ",
         {"div", [], []},
         "\n    ",
         {"div", [], ["\n         \n    "]},
         "\n\n"
       ]}
    ]}
 ]}

> Floki.traverse_and_update(doc, fn
  text when is_binary(text) ->
    case Regex.replace(~r/\n|[[:space:]]/, text, "") == "" do
      true -> nil
      false -> text
    end
  other -> other
end)
[
  {"html", [],
   [
     {"head", [], []},
     {"body", [],
      [
        "\n    ",
        {"div", [], []},
        "\n    ",
        {"div", [], ["\n         \n    "]},
        "\n\n"
      ]}
   ]}
]

Expected behavior

I'd like to remove all empty text nodes from given document, i.e. to get the following:

[
   {"html", [],
    [{"head", [], []}, {"body", [], [{"div", [], []}, {"div", [], []}]}]}
 ]

Is the current behavior a bug or is there some reason why it was designed like that? Alternatively, is there a better way to achieve what I want here? Thank you for any suggestions!

@ppatrzyk ppatrzyk added the Bug label Mar 6, 2021
@ppatrzyk
Copy link
Author

ppatrzyk commented Mar 6, 2021

maybe as one additional comment: I'm aware that others have exactly the opposite needs (#75) and that I can get ok results automatically when using Floki.HTMLParser.Mochiweb, I'm curious if it's possible to go with fast_html in my case

@philss
Copy link
Owner

philss commented Mar 8, 2021

@ppatrzyk thanks for opening the issue! 💜

I don't think this is a bug. I would say it was a decision from the time we added the feature, since you could update text nodes when capturing the HTML tags (there is an example below). Unfortunately this won't work for nodes that are in the root of the tree :/

The addition of this feature would be a breaking change, so I'm going to think about it.

For your case, you could do this:

Floki.traverse_and_update(doc, fn
  {tag, attrs, children} ->
    {tag, attrs,
     Enum.reject(children, fn child ->
       is_binary(child) && Regex.replace(~r/\n|[[:space:]]/, child, "") == ""
     end)}

  other ->
    other
end)

WDYT?

@ppatrzyk
Copy link
Author

ppatrzyk commented Mar 8, 2021

@philss thanks for your reply!

I'll adapt my code such that it works, thanks for your suggestion.

At minimum what would be helpful is mentioning this behavior in the documentation which currently reads:

This function returns a new tree structure that is the result of applying the given fun on all nodes

(emphasis mine)

@philss philss closed this as completed in fd88a28 Mar 8, 2021
@philss
Copy link
Owner

philss commented Mar 8, 2021

@ppatrzyk good point! I changed in fd88a28. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants