Skip to content

This repository is providing the source code and documentation about the Parquet Cube Ingestion described in the GMD publication "A Parquet Cube alternative to store gridded data for data analytics and modeling".

License

Notifications You must be signed in to change notification settings

necsi/parquetCubeIngestion

 
 

Repository files navigation

parquetCubeIngestion

This repository is providing the source code and documentation about the Parquet Cube Ingestion described in the GMD publication "A Parquet Cube alternative to store gridded data for data analytics and modeling".

Overview

Parquet Cube's goal is to allow the transformation of NetCDF data files into the Apache Parquet. format, and then store these parquet files in an Hadoop Distributed File System (HDFS) storage to make them available for further processing in a big data ecosystem.

This project contains the source code for the NetCDF to Parquet transformation, and the possibility to launch the transformation manually in a local environment

It also contains the ressources to deploy the transformation and ingestion of NetCDF files to Parquet in an HDFS storage as a kubernetes deployments. Included are the the source code for building the according docker images and helm charts.

Building Parquet Cube project

After cloning the project in your local directory, go inside the parent directory and compile the project :

cd parquet-cube-parent
mvn clean install

Building the docker images

The project contains 3 docker images :

  • base-bigdata-java : the base image to interact with the big data platform
  • hadoop-tools : the base image to interact with HDFS
  • parquet-cube-crawler : the image containing the transformation and ingestion code

To build and push these images in your docker registry:

cd parquet-cube-parent
mvn docker:build

Convert NetCDF files to Parquet locally

This repository comes with a script to run the NetCDF to Parquet transformation locally. This script can be found in the dev-local folder.

This folder contains configuration files and input/output folders that are used to replicate an environment on your local computer.

Please refer to the specific documentation in order to set up and use your local environment

Building the helm chart

cd misc/helm
mvn deploy

Deploy using the helm chart

Change the values.yaml inside the deployment/dev folder to your configuration and launch:

cd deployment/dev
../cicd/deploy.sh upstall

Set up environment

replace all 'todefine' values with your local configurations

About

This repository is providing the source code and documentation about the Parquet Cube Ingestion described in the GMD publication "A Parquet Cube alternative to store gridded data for data analytics and modeling".

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Scala 86.0%
  • Shell 12.6%
  • Dockerfile 1.1%
  • Other 0.3%