traverse_and_update is not applied to text nodes #338

ppatrzyk · 2021-03-06T10:15:32Z

Description

When using traverse_and_update, specified function is not applied to text nodes, these are passed as is:

floki/lib/floki/traversal.ex

Line 12 in 484565c

def traverse_and_update(text, acc, _fun) when is_binary(text), do: {text, acc}

To Reproduce

My setup:

Ubuntu 20.04
Elixir 1.11.2
Erlang/OTP 23
floki 0.30.0
fast_html 2.0

config.exs:

config :floki,
  html_parser: Floki.HTMLParser.FastHtml

> html = """
<body>
    <div></div>
    <div>
         
    </div>
</body>
"""
"<body>\n    <div></div>\n    <div>\n         \n    </div>\n</body>\n"

> {:ok, doc} = Floki.parse_document(html)
{:ok,
 [
   {"html", [],
    [
      {"head", [], []},
      {"body", [],
       [
         "\n    ",
         {"div", [], []},
         "\n    ",
         {"div", [], ["\n         \n    "]},
         "\n\n"
       ]}
    ]}
 ]}

> Floki.traverse_and_update(doc, fn
  text when is_binary(text) ->
    case Regex.replace(~r/\n|[[:space:]]/, text, "") == "" do
      true -> nil
      false -> text
    end
  other -> other
end)
[
  {"html", [],
   [
     {"head", [], []},
     {"body", [],
      [
        "\n    ",
        {"div", [], []},
        "\n    ",
        {"div", [], ["\n         \n    "]},
        "\n\n"
      ]}
   ]}
]

Expected behavior

I'd like to remove all empty text nodes from given document, i.e. to get the following:

[
   {"html", [],
    [{"head", [], []}, {"body", [], [{"div", [], []}, {"div", [], []}]}]}
 ]

Is the current behavior a bug or is there some reason why it was designed like that? Alternatively, is there a better way to achieve what I want here? Thank you for any suggestions!

The text was updated successfully, but these errors were encountered:

ppatrzyk · 2021-03-06T10:34:58Z

maybe as one additional comment: I'm aware that others have exactly the opposite needs (#75) and that I can get ok results automatically when using Floki.HTMLParser.Mochiweb, I'm curious if it's possible to go with fast_html in my case

philss · 2021-03-08T18:13:36Z

@ppatrzyk thanks for opening the issue! 💜

I don't think this is a bug. I would say it was a decision from the time we added the feature, since you could update text nodes when capturing the HTML tags (there is an example below). Unfortunately this won't work for nodes that are in the root of the tree :/

The addition of this feature would be a breaking change, so I'm going to think about it.

For your case, you could do this:

Floki.traverse_and_update(doc, fn
  {tag, attrs, children} ->
    {tag, attrs,
     Enum.reject(children, fn child ->
       is_binary(child) && Regex.replace(~r/\n|[[:space:]]/, child, "") == ""
     end)}

  other ->
    other
end)

WDYT?

ppatrzyk · 2021-03-08T18:49:56Z

@philss thanks for your reply!

I'll adapt my code such that it works, thanks for your suggestion.

At minimum what would be helpful is mentioning this behavior in the documentation which currently reads:

This function returns a new tree structure that is the result of applying the given fun on all nodes

(emphasis mine)

philss · 2021-03-08T22:56:43Z

@ppatrzyk good point! I changed in fd88a28. Thanks!

ppatrzyk added the Bug label Mar 6, 2021

philss closed this as completed in fd88a28 Mar 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

traverse_and_update is not applied to text nodes #338

traverse_and_update is not applied to text nodes #338

ppatrzyk commented Mar 6, 2021

ppatrzyk commented Mar 6, 2021

philss commented Mar 8, 2021

ppatrzyk commented Mar 8, 2021

philss commented Mar 8, 2021

traverse_and_update is not applied to text nodes #338

traverse_and_update is not applied to text nodes #338

Comments

ppatrzyk commented Mar 6, 2021

Description

To Reproduce

Expected behavior

ppatrzyk commented Mar 6, 2021

philss commented Mar 8, 2021

ppatrzyk commented Mar 8, 2021

philss commented Mar 8, 2021