Look at replacing most of crawler with an external crawling package #5

justinccdev · 2018-02-14T17:28:34Z

This is a big one, but it's possible that most of this crawler should be replaced with Apache Nutch or similar. I originally hacked this out as a proof-of-concept but as usual, it grew a bit from there. However, now meeting scalability issues (parallel crawling, possibly on multiple machines, crawling to a large database, etc.) that we need to take a serious use at a well-established alternative like Nutch.

Some questions

Is Nutch suitable? If so, 1.x or 2.x?

justinccdev · 2018-02-14T17:34:28Z

If the crawling code is replaced, ideally I would like to keep it around for a while, maybe within a plugin infrastructure. However, if this is too messy (which I suspect is likely), cleaner just to replace and remove all old code.

Also, I expect there will need to be some kind of 'shell' around nutch to present a user-friendly frontend. Python is really still my favourite language for this rather than doing everything in Java, but that might be another decision to make.

justinccdev · 2018-02-14T18:00:30Z

Also look at existing work by Federico on this at https://github.com/BioSchemas/bioschemas-nutch-indexer

justinccdev · 2018-02-15T16:57:17Z

Also consider http://stormcrawler.net/

justinccdev · 2018-02-15T16:59:49Z

From http://digitalpebble.blogspot.co.uk/2017/01/the-battle-of-crawlers-apache-nutch-vs.html, it looks like if we're going to use Nutch it should be 1.x, not 2.x

justinccdev · 2018-02-15T17:09:01Z

Also https://scrapy.org/, written in Python rather than Java

justinccdev · 2018-02-15T17:35:16Z

Perhaps also http://www.norconex.com/collectors/collector-http/

justinccdev · 2018-02-15T18:08:31Z

Rather than keep spamming this page, I've started writing the evaluation at https://github.com/justinccdev/bsbang-crawler/wiki/Transition-to-an-established-crawler-package, but comments can continue here.

justinccdev · 2018-02-19T17:33:23Z

Having now sampled and read various crawler projects, I think Scrapy/Frontera may be the way to go (see wiki page for more details). Will very soon start a new Github repository to explore re-implementing the crawler in the Scrapy/Frontera infrastructure.

XiangpengHao · 2018-02-25T19:51:30Z

I think Scrapy a good choice, it has a plugin scrapy-splash which addresses the #7 and far more popular than its competitors :)
I'll investigate more on scarpy these days.

justinccdev changed the title ~~Look at replacing most of crawler with Apache Nutch~~ Look at replacing most of crawler with an external crawling package Feb 16, 2018

justinccdev mentioned this issue Feb 19, 2018

Provide a way to download the crawl #1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Look at replacing most of crawler with an external crawling package #5

Look at replacing most of crawler with an external crawling package #5

justinccdev commented Feb 14, 2018

justinccdev commented Feb 14, 2018

justinccdev commented Feb 14, 2018

justinccdev commented Feb 15, 2018

justinccdev commented Feb 15, 2018

justinccdev commented Feb 15, 2018

justinccdev commented Feb 15, 2018

justinccdev commented Feb 15, 2018

justinccdev commented Feb 19, 2018

XiangpengHao commented Feb 25, 2018

Look at replacing most of crawler with an external crawling package #5

Look at replacing most of crawler with an external crawling package #5

Comments

justinccdev commented Feb 14, 2018

justinccdev commented Feb 14, 2018

justinccdev commented Feb 14, 2018

justinccdev commented Feb 15, 2018

justinccdev commented Feb 15, 2018

justinccdev commented Feb 15, 2018

justinccdev commented Feb 15, 2018

justinccdev commented Feb 15, 2018

justinccdev commented Feb 19, 2018

XiangpengHao commented Feb 25, 2018