Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Android Authority feed #614

Closed
PhilC813 opened this issue Dec 17, 2024 · 11 comments
Closed

Android Authority feed #614

PhilC813 opened this issue Dec 17, 2024 · 11 comments
Assignees

Comments

@PhilC813
Copy link
Contributor

PhilC813 commented Dec 17, 2024

Feed URL

https://www.androidauthority.com/feed/

Add any details, links, or screenshots about the article layout that's missing or wrong

In the following article, the name of the different sections, which are headers ("h2" elements), are stripped out in full content mode.

Article:
https://www.androidauthority.com/new-android-apps-658839/

2024_12_17_02 38 54

Here's the stripped out HTML, which is rather simple:
<h2 id="[number]">[Header]</h2>

I would assume that the same would occur for other articles in the feed.

Unless I'm mistaken, header elements shouldn't be stripped out regardless of the feed; they are usually relevant to the content.

Thank you!!

@PhilC813 PhilC813 changed the title Headers stripped out in Android Authority article Elements stripped out in Android Authority article Dec 17, 2024
@PhilC813 PhilC813 changed the title Elements stripped out in Android Authority article Android Authority feed Dec 17, 2024
@PhilC813
Copy link
Contributor Author

PhilC813 commented Dec 17, 2024

I also noticed that the embedded YouTube videos are stripped out, but unlike the headers, they seem oddly integrated in the HTML as they only appear in the JSON data. I understand if this can't be improved in the parser.

"TED Tumblewords" app section:
{"resource":"nc-embed-youtube","video":{"youtubeId":"1Z9fVc6v2aY"}}

"Carrion" app section:
{"resource":"nc-embed-youtube","video":{"youtubeId":"M6NOM2-UZdw"}}

@jocmp jocmp moved this to On Deck in Capy Reader Dec 17, 2024
@jocmp jocmp self-assigned this Dec 17, 2024
@jocmp
Copy link
Owner

jocmp commented Dec 17, 2024

Thanks for all the details. The missing headers should be simple. I'll see what I can do for that YouTube video JSON.

@jocmp jocmp moved this from On Deck to In Progress in Capy Reader Dec 18, 2024
@jocmp
Copy link
Owner

jocmp commented Dec 18, 2024

I have a custom parser I'll test a little more before adding to the next release (jocmp/mercury-parser#27).

I haven't found a way to grab the YouTube videos - it looks like they're doing something weird with JavaScript for non-browser clients (like Capy). That said, headers are working. Stay tuned!

Before After

@PhilC813
Copy link
Contributor Author

Sounds promising! Thank you!

If you made a custom parser for this feed specifically, is it because you don't consider header elements to be relevant to the content for most feeds?

@jocmp
Copy link
Owner

jocmp commented Dec 18, 2024

I think headers are important. I want to avoid breaking the core parser since I don't fully understand all the different parts yet. So adding a custom fix is easiest now.

In general sites can misuse headers which is why the previous maintainers built the parser that way. Luckily they left a note on why the built it that way.

Remove any headers that appear before all other p tags in the
document. This probably means that it was part of the title, a
subtitle or something else extraneous like a datestamp or byline,
all of which should be handled by other metadata handling.

@jocmp
Copy link
Owner

jocmp commented Dec 22, 2024

Updated as of 2024.12.1085-dev

@jocmp jocmp closed this as completed Dec 23, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Capy Reader Dec 23, 2024
@PhilC813
Copy link
Contributor Author

PhilC813 commented Jan 7, 2025

So far, articles seem rendered properly with the update to the parser! Thank you so much! It's a feed that I only check occasionally however so I'll let you know if I notice other things down the line.


One small thing, albeit not important;
For polls like in this article, the poll and voting choices get rendered in Capy but naturally, can't be interacted with because it must lack the JavaScript.

Website Capy
screenshot17362337972574627274782526763858 Screenshot_20250107_021035

Here's the HTML for that section:
<div class="e_e"><div class="e_Oh e_Kp"><h3 class="e_Np">Which Android XR form factor are you most excited about?</h3><div class="e_Op">1343 votes</div><div class="e_Pp e_Qh"><button type="button" class="e_Qp"><div class="e_Rp">The headset! Give me that immersion!</div><div class="e_Mp">10<!-- -->%</div><div class="e_Sp"><div class="e_Lp" style="width:10%"></div></div></button><button type="button" class="e_Qp"><div class="e_Rp">The glasses! I want that lightweight design I can use everywhere.</div><div class="e_Mp">46<!-- -->%</div><div class="e_Sp"><div class="e_Lp" style="width:46%"></div></div></button><button type="button" class="e_Qp"><div class="e_Rp">Both.</div><div class="e_Mp">14<!-- -->%</div><div class="e_Sp"><div class="e_Lp" style="width:14%"></div></div></button><button type="button" class="e_Qp"><div class="e_Rp">None. I&#x27;m already invested in another XR/VR/AR platform.</div><div class="e_Mp">3<!-- -->%</div><div class="e_Sp"><div class="e_Lp" style="width:3%"></div></div></button><button type="button" class="e_Qp"><div class="e_Rp">I don&#x27;t care about XR, AR, or VR.</div><div class="e_Mp">28<!-- -->%</div><div class="e_Sp"><div class="e_Lp" style="width:28%"></div></div></button></div></div></div>

No special element or meaningful class names that would allow you to tatget these polls in the custom parser, but after comparing with some of their other articles that contain polls, it appears that the classes e_Oh and e_Kp are the ones used to define these polls.

Should these be stripped out? Or maybe it could also be useful to the reader to know that there's actually a poll there, I'm not sure.

@jocmp
Copy link
Owner

jocmp commented Jan 8, 2025

@PhilC813 good find. Let me see if they're using JS for these. It would be nice to allow this functionality. If nothing else, I'll remove it to avoid the jarring markup.

@PhilC813
Copy link
Contributor Author

In this article:
https://www.androidauthority.com/best-android-apps-2024-3506812/

The title of the apps in display are stripped out in Capy. Could this be improved?

There's also another example of a poll when you'll need to confirm your results with #699!

@jocmp
Copy link
Owner

jocmp commented Jan 14, 2025

yep! I'll roll that in with the polls since the also use h3 headings. (jocmp/mercury-parser#41)

Here's a preview from my markup tester. "Google Gemini" and "Mozilla Thunderbird" were previously hidden.

@jocmp
Copy link
Owner

jocmp commented Jan 18, 2025

@PhilC813 the heading and poll updates are available as of 2025.01.1096-dev!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

2 participants