Big Data Analysis and Management Course Project
Author: floodfill, ConanChou
Identify pages that have infoboxes. You will scan the Wikipedia pages and generate a CSV file named infobox.csv with the following format for each line corresponding to a page that contains an infobox: page_id, infobox_text If a page does not have an infobox, it will not have a line in the CSV file. The infobox_text should contain all text for the infobox, including the template name, attribute names and values. Note that the id for a Wikipedia page is its title.
Create a new job flow in Amazon Elastic Map/Reduce with following configurations
- Hadoop version: 1.0.3
- Custom jar file:
s3n://diaosi-mapreduce/pro1.jar
- Jar arguments:
s3n://diaosi-mapreduce/raw_data s3n://<your-bucket-name>/<your-output-folder>
- Create a new Java Project in Eclipse
- Import the src folder to the project
- Add to build path dependency libraries where you can find them in the folder
libs
- Export runnable jars with com.github.diaosi.BDAM.mapreduce.InfoboxGetter as the class to be launched
- Upload the jar you just generated to Amazon S3
- (Optional and we already have this done)
Extract Bzip2-compressed wikipedia dumps to raw xml files and upload them to Amazon S3,
or use
s3n://diaosi-mapreduce/raw_data
as the input - Go to Amazon Elastic Map/Reduce, create a new job flow, run the jar with the first parameter as the input path and second one as the output path
- Keep your finger crossed while the job flow is running until results are generated
- Switch to your S3 bucket and find results listed in the folder you set up before
- They should be in CSV format that can be open by any spreedsheet softwares
- Note that Microsoft Excel is not powerful enough to handle opening any UTF-8 encoded csv files
- Also note that different line separators have been used in the generated csv files, we use
\r\n
as the real line separtor and\n
is what we used inside the infobox
The Phase II codes are under src/com/github/diaosi/BDAM/mapreduce(hadoop code), src/com/github/diaosi/BDAM/utils(single node code), and visualization(visualization code).