Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch between RSS content and Web page content (store both) #1999

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

danielyrovas
Copy link

@danielyrovas danielyrovas commented Aug 2, 2023

Please let me know what you think of this change.
It solves an issue that has irked me when using the interface, which is overwriting the original RSS content with the webpage content which sometimes is not as well formatted as the original RSS content or is empty/junk. It allows people to hit the download button just to see what the page looks like when it is downloaded, but still return to the RSS page to continue reading.

The feature works like this:
Whenever an entry is loaded, if there is downloaded content available, that will be displayed and the Download button becomes Show RSS Content.

Otherwise the page loads like normal, showing the Download button and the RSS page content.

If the Show RSS Content button is selected, the original RSS content is displayed and the button turns to the regular Download button.

If this Download button is selected, the page is processed and saved like normal (re-downloading it even if it has been saved before).

The /fetch-content API method now returns both content and web_content as JSON.

@@ -80,14 +80,15 @@ func ProcessFeedEntries(store *storage.Storage, feed *model.Feed, user *model.Us
logger.Error(`[Processor] Unable to crawl this entry: %q => %v`, entry.URL, scraperErr)
} else if content != "" {
// We replace the entry content only if the scraper doesn't return any error.
entry.Content = content
// TODO: document change
entry.WebContent = content
}
}

rewrite.Rewriter(url, entry, feed.RewriteRules)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic, I didn't look at how rewriter gets its data - I'll add a commit which fixes this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewriter now operates on web_content - as far as I can see rules do not operate on RSS content - only when processing original web page content.

@danielyrovas
Copy link
Author

I merged the commits and rebased onto main.

@fguillot
Copy link
Member

Looks like the unit tests are failing after your rebase

@danielyrovas
Copy link
Author

Yeah, I'll need to change some of those tests I think. The TestRewriteWithYoutubeLink checks the Content field instead of the WebContent field. I imagine it is similar for the other failing ones.

@danielyrovas danielyrovas force-pushed the dev-keep-rss-content-and-web-content branch 2 times, most recently from 77b36eb to d00bcfc Compare August 18, 2023 05:16
internal/api/entry.go Outdated Show resolved Hide resolved
@danielyrovas danielyrovas force-pushed the dev-keep-rss-content-and-web-content branch 3 times, most recently from f397706 to 34f4898 Compare September 1, 2023 04:53
@danielyrovas danielyrovas force-pushed the dev-keep-rss-content-and-web-content branch from 628dc2a to 29ee5ec Compare March 28, 2024 10:18
@danielyrovas
Copy link
Author

Hi @fguillot, I've updated my pull request. Do you mind taking a look at it?

Copy link
Member

@fguillot fguillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should the rewrite rules apply only to WebContent? They should apply to both Content and WebContent, especially for feeds that provide full HTML content by default.

@oldherl
Copy link

oldherl commented Aug 9, 2024

Any progress on this?

@wolfhechel
Copy link
Contributor

This to me seems to be changing the way the current v1/fetch-content endpoint works. The expectation is that the returned content key will be the web content, as such it would be breaking compatibility with current clients such as my own.

Surely, such a change should be prefixed to a bumped version - otherwise, what's the point of the v1 prefix at all?

Also I don't see why fetch-content needs to be changed at all since the purpose of it is to return the web content. An easy solution would be to switch web_content to be content as before and return entry_content for the original content.

Clients are generally okay with additional API information, not so much with removed or changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants