potential improvements / new features to the extraction model? #86

bdewilde · 2019-04-02T16:01:47Z

I was doing a quick lit review to see if/how the state-of-the-art in web content extraction had changed over the past few years, and came upon a conference paper from last September, Learning Web Content Extraction with DOM Features, that seems interesting, relevant, and performant. There's also code: see learnhtml. Is there any interest in implementing its feature set within dragnet, and evaluating model performance with such features? This could be related to updates proposed in Issue #85.

matt-peters · 2019-04-03T16:48:24Z

Anything that improves the performance is very welcome!

acertain · 2022-01-05T19:31:00Z

Another new package https://github.com/adbar/trafilatura see also https://github.com/scrapinghub/article-extraction-benchmark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

potential improvements / new features to the extraction model? #86

potential improvements / new features to the extraction model? #86

bdewilde commented Apr 2, 2019

matt-peters commented Apr 3, 2019

acertain commented Jan 5, 2022 •

edited

Loading

potential improvements / new features to the extraction model? #86

potential improvements / new features to the extraction model? #86

Comments

bdewilde commented Apr 2, 2019

matt-peters commented Apr 3, 2019

acertain commented Jan 5, 2022 • edited Loading

acertain commented Jan 5, 2022 •

edited

Loading