nutch-elasticsearch

A prototype system to integrate nutch 2.2 with elasticsearch 1.1 and hbase 0.90.

Installation

Fire up vagrant:
```
 vagrant up
 vagrant ssh
 cd /vagrant
```
Download ant, nutch and hbase:
```
 bin/wget-deps.bash
```
Check elasticsearch is running:
```
 curl http://localhost:9200
```

Start elasticsearch and create an index:

 curl -XPUT http://localhost:9200/nutch/

Run build-nutch.bash to build using ant/ivy and install config file:
```
 /vagrant/build-nutch.bash
```
Start hbase:
```
 /opt/hbase-0.90.4/bin/start-hbase.sh
```
(Optional) Install BigDesk:
1. Download: https://github.com/lukas-vlcek/bigdesk/tarball/master
2. Extract BigDesk into /var/www/html/bigdesk
3. Visit the app: http://localhost:8080/bigdesk/

Run the nutch crawler:

 cd /opt/apache-nutch-2.2.1/runtime/local
 /vagrant/bin/index-url.bash /vagrant/conf/urls.txt

Test:

 bin/nutch readdb -url `cat urls/urls.txt`

Index into elasticsearch:

 bin/nutch elasticindex elasticsearch -all

Helpful Information

the crawldb is stored in hbase.

nutch commands

Simplest crawling:

cd runtime/local
echo "http://www.kusiri.com" > urls/urls.txt
bin/nutch inject urls/
bin/nutch generate -topN 1
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb

bin/nutch elasticindex elasticsearch -all

elasticsearch sense commands

GET /nutch/_search

hbase commands

Open a shell:

~/hbase/bin/hbase shell

Get help (from inside the shell):

hbase(main):001:0> help

List all tables:

hbase(main):001:0> list

Delete (i.e. disable then drop) the 'webpage' table:

hbase(main):002:0> disable 'webpage'
hbase(main):004:0> drop 'webpage'

Leave the shell:

hbase(main):002:0> exit

elasticsearch commands

create index:

curl -XPUT 'http://localhost:9200/twitter/'

nodes stats:

curl -XGET 'http://localhost:9200/_nodes/stats'

Troubleshooting

ClusterBlockException

[vagrant@localhost local]$ bin/nutch elasticindex elasticsearch -all
Exception in thread "elasticsearch[Caiera][generic][T#2]" org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];
    at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:138)
    at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:128)
    at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:197)
    at org.elasticsearch.action.bulk.TransportBulkAction.access$000(TransportBulkAction.java:65)
    at org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(TransportBulkAction.java:143)
    at org.elasticsearch.action.support.TransportAction$ThreadedActionListener$2.run(TransportAction.java:117)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

Check the following:

Your elasticsearch configuration is correct.

Your firewall is disabled:

  sudo service iptables stop
  sudo chkconfig iptables off

References

NutchTutorial
Nutch2Tutorial
Nutch 2 and ElasticSearch - helpful blog post ** Integrating Nutch 1.7 with ElasticSearch
NUTCH-1745 - Upgrade to ElasticSearch 1.1.0
1.2. Quick Start - hbase user manual
Hbase/Shell - from the Hadoop wiki

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bin		bin
conf		conf
downloads		downloads
patches		patches
puppet		puppet
.gitignore		.gitignore
README.md		README.md
Vagrantfile		Vagrantfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nutch-elasticsearch

Installation

Helpful Information

nutch commands

elasticsearch sense commands

hbase commands

elasticsearch commands

Troubleshooting

ClusterBlockException

References

About

Releases

Packages

Languages

dawngerpony/nutch-elasticsearch

Folders and files

Latest commit

History

Repository files navigation

nutch-elasticsearch

Installation

Helpful Information

nutch commands

elasticsearch sense commands

hbase commands

elasticsearch commands

Troubleshooting

ClusterBlockException

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages