JCascalog Example with Parquet Data Format.
Make sure that the native lzop is installed on your hadoop cluster, for more details, please see https://github.com/twitter/hadoop-lzo.
This example M/R Job output produces parquet data from the input json data.
Edit the namenode and jobtracker configuration in jcascalog.parquet.WriteParquetFromJson.main():
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://hadoop01:9000");
conf.set("mapred.job.tracker", "hadoop01:9001");
To run M/R Job:
mvn -e -DskipTests=true clean install assembly:single ;
hadoop fs -rmr parquet/json;
hadoop fs -rmr parquet/out;
hadoop fs -rmr /tmp/lib/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar;
hadoop fs -mkdir /tmp/lib;
hadoop fs -mkdir parquet/json;
hadoop fs -put target/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar /tmp/lib;
hadoop fs -put src/test/resources/electricPowerUsageGenerator.json parquet/json/
hadoop jar target/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar \
jcascalog.parquet.WriteParquetFromJson \
parquet/json/electricPowerUsageGenerator.json \
parquet/out \
/tmp/lib/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar;
With this raw M/R Job, the specified column will be read from the parquet data.
Edit the namenode and jobtracker configuration in jcascalog.parquet.ReadSpecifiedColumns.main():
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://hadoop01:9000");
conf.set("mapred.job.tracker", "hadoop01:9001");
To run this M/R Job:
hadoop jar target/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar \
jcascalog.parquet.ReadSpecifiedColumns \
parquet/out/ \
parquet/specified-columns \
/tmp/lib/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar;
With JCascalog, the specified column will be read from the parquet data.
Edit the namenode and jobtracker configuration in jcascalog.parquet.ReadSpecifedColumnsWithParquetScheme.main():
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://hadoop01:9000");
conf.set("mapred.job.tracker", "hadoop01:9001");
To run this M/R Job:
hadoop fs -rmr parquet/specified-columns ;
hadoop jar target/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar \
jcascalog.parquet.ReadSpecifedColumnsWithParquetScheme \
parquet/out/ \
parquet/specified-columns \
/tmp/lib/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar;
To verify parquet data, read the parquet data, and produce it as text data.
Edit the namenode and jobtracker configuration in jcascalog.parquet.WriteTextFromParquet.main():
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://hadoop01:9000");
conf.set("mapred.job.tracker", "hadoop01:9001");
To run this M/R Job:
hadoop jar target/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar \
jcascalog.parquet.WriteTextFromParquet \
parquet/specified-columns \
parquet/text-out \
/tmp/lib/jcascalog-parquet-0.1.0-SNAPSHOT-hadoop-job.jar;