Skip to content

Hadoop MapReduce tool to convert Avro data files to Parquet format.

Notifications You must be signed in to change notification settings

tispratik/avro2parquet

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

avro2parquet

Hadoop MapReduce program to convert Avro data files to Parquet format.

Installation

git clone https://github.com/laserson/avro2parquet.git
cd avro2parquet
mvn clean package

This will generate the jar files in the target/ directory.

Usage

This tool will work on Avro container files (which I believe is just the standard Avro data file format). It contains the Avro GenericRecord objects as the key and a NullWritable as the value.

The tool is currently hardcoded to output Gzip-compressed Parquet. It is simply a MapReduce job using the Tool interface.

The command is like so:

hadoop jar <avro2parquet jar file> \
com.cloudera.science.avro2parquet.Avro2Parquet \
<and generic options to the JVM> \
hdfs:///path/to/avro/schema.avsc \
hdfs:///path/to/avro/data \
hdfs:///output/path

so for example:

hadoop jar avro2parquet-0.1.0-jar-with-dependencies.jar \
com.cloudera.science.avro2parquet.Avro2Parquet \
-D mapred.child.java.opts=-Xmx1024M \
hdfs:///user/lasersou/schemas/data.avsc \
hdfs:///user/lasersou/data \
hdfs:///user/lasersou/output

About

Hadoop MapReduce tool to convert Avro data files to Parquet format.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 100.0%