Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to get HTML for winning content? #47

Open
kevzettler opened this issue Mar 21, 2017 · 4 comments
Open

Possible to get HTML for winning content? #47

kevzettler opened this issue Mar 21, 2017 · 4 comments

Comments

@kevzettler
Copy link

Is there any method to get the HTML for the winning block content? I'd like to also get img code pre elements and preserve formatting with p and heading tags where possible.

@matt-peters
Copy link
Collaborator

Unfortunately not in an easy manner. You can get the lxml etree object for the start tag of each block then use it to reconstruct the HTML but that's not easy to do. If you'd like to try to implement, you an start by passing blocks=True to analyze when extracting the content. This will return a list of block objects for the extracted content. Then block.features['block_start_element'] contains the object.

Something like:

blocks = content_extractor.analyze(html, blocks=True)
start_elements  = [block.features['block_start_element'] for block in blocks]

@rferreiraperez
Copy link

It would be possible to keep at least the line breaks in result text?

@MSusik
Copy link

MSusik commented Jun 29, 2018

@rferreiraperez A workaround has been shown in #22

EDIT:

In current version, you can get blocks with: dragnet.extract_content_and_comments(site, as_blocks=True)

@lukaspistelak
Copy link

blocks = content_extractor.analyze(html, blocks=True) start_elements = [block.features['block_start_element'] for block in blocks]

AttributeError: 'Extractor' object has no attribute 'analyze'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants