Skip to content

Latest commit

 

History

History
88 lines (55 loc) · 3.65 KB

README.md

File metadata and controls

88 lines (55 loc) · 3.65 KB

nutch-aws

Overview

This project intends to document the pitfals and tricks of running Nutch in a AWS EMR Hadoop Cluster.

Getting Started

Get a Amazon Linux AMI Instance running on EC2

The first step to get one machine set up with the basic tools and configuration. I find that the simplest thing to do is to launch a t1.micro EC2 instance Amazon Linux AMI as it comes with most of the tools I need (ami-cli among them), it is very easy to replicate and very unexpensive. I won't cover this step here as it is well documented on the web. (hint:The Amazon Linux AMI is the first choice in the Quick Start tab at the web based "Classic Wizard" EC2 launcher.) Make sure you have security rights to ssh to this machine.

Configuration

  1. ssh to the Amazon Linux AMI instance, create a working folder, e.g., ~/nutch-aws, we will refer to it as NUTCH_AWS_HOME from now on.

  2. scp your key-pair (.pem) file to this instance under NUTCH_AWS_HOME

  3. ssh back to the Amazon Linux AMI instance

  4. Yum install ant

     sudo yum install ant -y
    
  5. Get the Makefile* from github into NUTCH_AWS_HOME

     wget https://raw.github.com/eleflow/nutch-aws/master/Makefile
    
  6. Fill in the blanks in the Makefile

     ACCESS_KEY_ID = 					## YOUR ACCESS KEY ID
     SECRET_ACCESS_KEY = 				## YOUR ACCESS KEY SECRET
     AWS_REGION = us-east-1				## CHANGE IT IF YOU WANT 
     EC2_KEY_NAME = 						## YOUR ACCESS KEY NAME
     KEYPATH	= ${HOME}/${EC2_KEY_NAME}.pem ## YOUR ACCESS KEY  FILE (IF IT"S DIFFERENT THAN ${HOME}/${EC2_KEY_NAME}.pem)
     S3_BUCKET = 						## THE S3 BUCKET WHERE FILES WILL BE READ FROM AND WRITTEN TO
     CLUSTERSIZE	= 3						## NUMBER OF MACHINES IN THE CLUSTER
     DEPTH = 3							## HOW MANY LINK HOPS WILL THE CRAWLER GO
     TOPN = 5							## HOW MANY OUTLINKS WILL BE FOLLOWED
     MASTER_INSTANCE_TYPE = m1.small 
     SLAVE_INSTANCE_TYPE = m1.small
    
  7. Checking the configuration:

     make s3.list
    

    This should list the S3 buckets associated with your account if the configuration is correct

  8. Create a NUTCH_AWS_HOME\urls\seed.txt file with the urls that will be a starting point to the crawler.

Running

Copying nutch job jar and seed files to S3

	make bootstrap

This make target will do these:

  1. download the Nutch 1.6 source code
  2. build the nutch 1.6 map reduce job jar.
  3. copy the nutch 1.6 map reduce job jar to s3://S3_BUCKET/lib
  4. copy the contents from the NUTCH_AWS_HOME\urls folder to s3://S3_BUCKET/url

Launching a cluster

	make create

This make target will do these:

  1. start a emr cluster and run theese mr jobs:
    1. run the Nutch crawl job
    2. run the Nutch mergesegs job
    3. copy the crawldb, linkdb, and merged segments folders from hdfs://users/hadoop/crawl to s3://S3_BUCKET/crawl
  2. copy the logs to s3://S3_BUCKET/logs

The content of ./jobflowid file shoulfd be a jobflow id (e.g. j-IR4OQTH2HE7Z )if everything went well.

Note: the cluster is launched with "keep_job_flow_alive_when_no_steps" set to false which means it will be destroyed after the steps are completed.

Checking the master node

	make ssh

This will ssh into the master node and will give access to the hadoop command line tool and the logs at /mnt/var/log/hadoop.

Destroying

	make destroy

This will kill any job that the cluster may be running and terminate the cluster.

[*]Yeah it's a Makefile. I based it on Karan's and it may not the best tool for the job it was an easy way to get things rolling quick.