-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make HTML escaping configurable #25
Comments
Yeah, this is an encoding issue, I think. We've had issues with this before. When HTML is written out as a file for parsing or rendering, it's encoded to have those entities appear in that The issue is that Hiccup has chosen to represent the text in an HTML tree in the HTML-encoded form.
As you can see, hiccup makes no change to the text it's given when it puts it out to the HTML. It's supposed to already be HTML-encoded. So we as hickory, in aiming to make a way to go from HTML to hiccup, have to do the HTML encoding ourselves when we convert something to hiccup form. But the entire time, we are working with the "logical" form of the text in memory:
I don't think that either approach is more correct than the other, but we did make different choices on this issue (I wasn't even aware of the difference at the time, until someone was having problems in the other direction earlier). I do like being able to work with the text in an HTML document without worrying about the HTML encoding issues until it is converted to another form, I think that's worth it. I'm not sure that an option as you suggest is the way to fix this, though. I think the real answer is just to be more careful about the encodings of the data as we work with it. |
If I pull Everything you've said makes good sense but I'm not sure it helps with this. Am I missing something? Of course my little fork does the job but it's not ideal in terms of maintenance etc. |
Yeah, I understand. You're not missing anything, I just don't think hickory is the place to insert a fix for the confusion caused by the combination of technologies you're using. Hickory parses HTML correctly, and it outputs HTML correctly. It also outputs hiccup forms correctly, in the sense that they can be passed without modification to hiccup and it will render to the correct HTML. I do notice that Reagent does not claim to accept hiccup as input, but rather a "hiccup-like" format. I'm sure they have their reasons for not accepting the original hiccup format, but I don't consider supporting their format a goal. What I would do is fix this in your project. Make your own protocol similar to HiccupRepresentable, say something like ReagentHiccuplikeRepresentable, and reimplement it on the jsoup nodes to give you the hiccup-like output you need in order to get those things speaking to each other correctly. |
Thanks for this, I fully understand and accept your decision that hickory isn't the place for this. However if you'll humour me I have one last (largely academic) argument to make.
The fact somebody needs to parse HTML guarantees they want to transform it, but does not guarantee that they want to turn it back into HTML again later. Given that a parser's existence is only justified by the need to manipulate, that's what it should enable: the canonical form of
The specific change required to solve this problem only brings good things: it's simple and short (~3/4 lines), is completely non-breaking, and (as above) completes core parsing functionality. Thoughts? |
I actually don't find hiccup a great format for non-trivial manipulation of the HTML data. In fact, I wrote hickory (specifically the data format) because of the difficulties I was having in that use case. I went into some of those in my comments on issue #24. My recommendation is to do as much manipulation as possible on the hickory format, and then convert that to hiccup at the end. Of course, hiccup has other strengths, and it's a widely used format in the clojure world. Maybe Reagent should consider using it, but as things stand, the fact that Hickory doesn't produce data that Reagent wants to consume is due to its being a bystander to tech choices that others have made. Now, if Reagent wanted to formalize and document its data format and press that forward as something useful outside of Reagent proper, it might make sense for Hickory to output it. However, I've found nothing like that. They link to the hiccup repo on their "Hiccup-like" link, and give some examples. I've done a number of libraries now that deal with common data formats that are formally specified to varying degrees, and it's too much work to pursue every minor embellishment that would help someone out with their data, even when there is a clear spec for the main data format. Clear inputs and outputs make these sorts of things much more manageable, and it keeps this code in a pretty reliable state. Additionally, I don't know Node, I don't know React, and I don't know Reagent, so I don't want to get involved in maintaining such code unless it is likely to be of very high value. I wouldn't be likely to do a very good job of maintaining code so far beyond my skill set ether. However, you are well-positioned to maintain code that does this, because you have a project with the combination of data inputs and outputs, and the technology expertise, that is relevant here. Unless it's apparent that there is a wide need for this HTML -> analysis -> hiccup-like combination, it seems like this is a combination of needs fairly unique to your project. So it seems like the best thing to do would be to take your code that does this massaging and have that exist in your own project as a utility. Though, if you are really planning to manipulate the structured HTML data a good amount, I'd again suggest working with the hickory version, and then taking |
I just want to make clear that I'm not arguing for Reagent support as I agree that that is beyond the scope of Hickory, and I'm not selfish enough to drag this debate out in pursuit of a minor embellishment that benefits a tiny number of users. The only reason I mentioned it all was for context; the case I'm making, while obviously motivated by Reagent, stands independently of the specifics of this use-case. I understand that the battle is lost, so to speak, and again I want to point out that this argument really is academic, but to be clear the point I am making is that the need to parse exists if and only if there is a need to do computation on it: if you don't need to do that, you don't need to parse it - end of story. Any parser that only partially parses it's input is not fully enabling this computation and, ergo, is not fully doing it's job (again a theoretical point, not scathing criticism.) If the hiccup format was no good for computation, but only rendering, then why include it in a parsing library? I mean whats the point of Anyhow - I get it. Ultimately I'm making the case for a kind of theoretical purity in what a parser "should be" and you're quite rightly sticking up for pragmatic concerns. Nevertheless thanks for entertaining the notion. 😄 |
👍 I have this problem too and @lorddoig I was also thinking to zip descending the three unescaping...I will check yous branch for your solution. BTW I was pretty surprised when I saw all the escaped text, but after reading this I guess it was not an easy choice from hickory's side. |
Edit Well, never mind, that obviously won’t work since one can’t tell beforehand which Edit the second I was using Specter to slice and dice Hiccup (from Hickory), so here’s a stupid, expensive way to unescape Hiccup:
|
Here's the function Hickory uses for HTML escaping: https://github.com/davidsantiago/quoin/blob/master/src/quoin/text.clj#L4 It just does &, <, >, and ". |
I would also appreciate a configuration option. I'm trying to put some syntax colour on Clojure forms, like this: (pr-str (map hickory/as-hiccup
(hickory/parse-fragment
(highlight-html edn-form)))) Parsing a form like this using (highlight-html "(println \"foo\")") emits "<div class=\"syntax\"><pre><span class=\"s-exp\">(</span><span class=\"core-fn\">println</span> <span class=\"string\">\"foo\"</span><span class=\"s-exp\">)</span></pre></div>" The 'problem' is that glow emits \"foo\" in the HTML - which is obviously the correct output. After parsing by hickory it becomes "([:div {:class \"syntax\"} [:pre {} [:span {:class \"s-exp\"} \"(\"] [:span {:class \"core-fn\"} \"println\"] \" \" [:span {:class \"string\"} \""foo"\"] [:span {:class \"s-exp\"} \")\"]]])" using the unescape trick from @fasiha make the output nicer but I lose the quotes on "foo" which means that I cannot properly represent the input form. Ideally I would like to only unescape the TextNode but it seems like this is ruled out. Anyway, I like the library and am adding this comment so that you have another data point. |
I came here looking for a resolution for the same problem. Specifically, as above, with regards to syntax highlighting. |
As I understand from this ticket, Hickory produces correct Hiccup, I guess this is a bug or missing feature in Reagent, hence I asked for a fix there: reagent-project/reagent#413 |
imho this has very little to do with reagent. |
I'm currently using
as-hiccup
to transform parsed HTML into a vector I can save and later render with Reagent. This works almost flawlessly except for the fact thatas-hiccup
escapes anyTextNode
s it comes across, meaning things like&
show up as literal strings when I pipe it down to the browser.Maybe a keyword argument is the way to go? Not sure how that works with protocols.
The text was updated successfully, but these errors were encountered: