Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot handle div elements nested within <a> tags #4186

Closed
OneSir opened this issue Oct 30, 2024 · 2 comments
Closed

Cannot handle div elements nested within <a> tags #4186

OneSir opened this issue Oct 30, 2024 · 2 comments

Comments

@OneSir
Copy link

OneSir commented Oct 30, 2024

image

cheerio code:
`function parseHtml(html: string): Element[] {
const $ = cheerio.load(html);
console.log($.html())
function extractElement($el: cheerio.Cheerio): Element {
const tag = $el.prop('tagName').toLowerCase();
const attributes = $el.attr() || {};

const children: Element[] = [];
$el.children().each((_, child) => {
  children.push(extractElement($(child)));
});

return {
  tag,
  attributes,
  children: children.length > 0 ? children : undefined,
};

}
const rootElement = $('body').children().first();
return [extractElement(rootElement)];
}`

Want a tag child element that can be extracted hierarchically.

@OneSir
Copy link
Author

OneSir commented Oct 30, 2024

return
[ { "tag": "div", "attributes": { "class": "header__icons display-none-tablet" }, "children": [ { "tag": "a", "attributes": { "id": "HeaderBtnBox", "class": "Modern_wishlist_header header__btn seedmore_wishlist_header", "style": "display: inline-flex; justify-content: center; align-items: center; min-width: 40px;" } }, { "tag": "a", "attributes": { "class": "indexstyle__LinkBox-sc-18jtyu9-1 eyYnPk header__btn notranslate", "href": "/user/signIn?redirectUrl=/pages/wishlist_5e.buvk" } }, { "tag": "div", "attributes": { "direction": "row", "class": "indexstyle__LinkBtn-sc-18jtyu9-0 jMUhyS index-module_link_btn_05f09c6", "style": "flex-direction: row; direction: ltr;" }, "children": [ { "tag": "a", "attributes": { "class": "indexstyle__LinkBox-sc-18jtyu9-1 eyYnPk header__btn notranslate", "href": "/user/signIn?redirectUrl=/pages/wishlist_5e.buvk" } }, { "tag": "div", "attributes": { "class": "index-module_link_btn_content_3d4aa20" }, "children": [ { "tag": "a", "attributes": { "class": "indexstyle__LinkBox-sc-18jtyu9-1 eyYnPk header__btn notranslate", "href": "/user/signIn?redirectUrl=/pages/wishlist_5e.buvk" } }, { "tag": "div", "attributes": { "class": "index-module_Popconfirm_8b8fc27" }, "children": [ { "tag": "a", "attributes": { "class": "indexstyle__LinkBox-sc-18jtyu9-1 eyYnPk header__btn notranslate", "href": "/user/signIn?redirectUrl=/pages/wishlist_5e.buvk" }, "children": [ { "tag": "div", "attributes": { "class": "index-module_Arrow_06e31e6" }, "children": [ { "tag": "div", "attributes": { "class": "index-module_ArrowContent_3de6e9c" } } ] } ] }, { "tag": "div", "attributes": { "class": "index-module_loginContainer_44bd926" }, "children": [ { "tag": "a", "attributes": { "class": "indexstyle__LinkBox-sc-18jtyu9-1 eyYnPk header__btn notranslate", "href": "/user/signIn?redirectUrl=/pages/wishlist_5e.buvk" }, "children": [ { "tag": "div", "attributes": {}, "children": [ { "tag": "svg", "attributes": { "width": "77", "height": "66", "viewBox": "0 0 77 66", "fill": "none", "xmlns": "http://www.w3.org/2000/svg" }, "children": [ { "tag": "path", "attributes": { "d": "", "stroke": "black", "stroke-width": "3", "stroke-linecap": "round", "stroke-linejoin": "round" } } ] } ] }, { "tag": "div", "attributes": { "class": "index-module_loginText_e702488" } } ] }, { "tag": "div", "attributes": { "style": "width: 100%;" }, "children": [ { "tag": "a", "attributes": { "class": "indexstyle__LinkBox-sc-18jtyu9-1 eyYnPk header__btn notranslate", "href": "/user/signIn?redirectUrl=/pages/wishlist_5e.buvk" } }, { "tag": "a", "attributes": { "class": "index-module_loginBtn_3c121c9", "href": "/user/signIn", "style": "background: rgb(0, 0, 0); color: rgb(205, 242, 77);" } } ] } ] } ] } ] }, { "tag": "span", "attributes": { "style": "position: relative; display: inline-flex;" }, "children": [ { "tag": "svg", "attributes": { "xmlns": "http://www.w3.org/2000/svg", "width": "20", "height": "19", "viewBox": "0 0 20 19", "fill": "none" }, "children": [ { "tag": "path", "attributes": { "d": "", "fill": "white" } }, { "tag": "path", "attributes": { "d": "", "fill": "currentColor" } } ] } ] }, { "tag": "span", "attributes": { "class": "indexstyle__TitleText-sc-18jtyu9-2 cOZAzz body4" } } ] }, { "tag": "div", "attributes": { "class": "user__container" }, "children": [ { "tag": "a", "attributes": { "class": "icon-button header__icon-button", "href": "/user/signIn" }, "children": [ { "tag": "svg", "attributes": { "class": "icon icon-user", "width": "28", "height": "28", "viewBox": "0 0 28 28", "fill": "none", "xmlns": "http://www.w3.org/2000/svg" }, "children": [ { "tag": "path", "attributes": { "fill-rule": "evenodd", "clip-rule": "evenodd", "d": "M14 13C16.2091 13 18 11.2091 18 9C18 6.79086 16.2091 5 14 5C11.7909 5 10 6.79086 10 9C10 11.2091 11.7909 13 14 13ZM9 14C6.79086 14 5 15.7909 5 18V22C5 22.5523 5.44772 23 6 23H22C22.5523 23 23 22.5523 23 22V18C23 15.7909 21.2091 14 19 14H9Z", "fill": "currentColor" } } ] } ] } ] }, { "tag": "a", "attributes": { "id": "cart-icon-bubble", "class": "icon-button header__icon-button ", "href": "/cart" }, "children": [ { "tag": "div", "attributes": { "id": "cart-icon-bubble-wrapper" }, "children": [ { "tag": "svg", "attributes": { "class": "icon icon-cart", "width": "28", "height": "28", "viewBox": "0 0 28 28", "fill": "none", "xmlns": "http://www.w3.org/2000/svg" }, "children": [ { "tag": "path", "attributes": { "d": "M6 12C6 11.4477 6.44772 11 7 11H21C21.5523 11 22 11.4477 22 12V22C22 22.5523 21.5523 23 21 23H7C6.44772 23 6 22.5523 6 22V12Z", "fill": "currentColor" } }, { "tag": "path", "attributes": { "d": "M18 10C18 11.0677 18.0001 13.5 18.0001 13.5H10.0001L10 10C10 7.79086 11.7909 6 14 6C16.2091 6 18 7.79086 18 10Z", "stroke": "currentColor", "stroke-width": "2" } } ] } ] } ] } ] } ]

@fb55
Copy link
Member

fb55 commented Dec 25, 2024

This is how HTML works. Try using htmlparser2 to parse (eg. using the options { xml: { xmlMode: false }}) to escape some of the HTML parsing specifics.

@fb55 fb55 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants
@fb55 @OneSir and others