Skip to content

Latest commit

 

History

History
35 lines (30 loc) · 3.24 KB

Parquet_to_Tensorflow_PyTorch_HowTo.md

File metadata and controls

35 lines (30 loc) · 3.24 KB

Three methods to feed Parquet data into TensorFlow or PyTorch

This note documents three easy methods to help you get started with data pipelines for TensorFlow or PyTorch. For simplicity, we assume the data source to be Apache Parquet files, but this can be easily extended to other data formats.
Choose one of these methods, as it best fits your problem:

  1. Load data into memory and feed it to TensorFlow or Pytorch
  2. Use the Petastorm library to load Parquet and feed it to TensorFlow or Pytorch
  3. Convert data to the TFRecord data format and process it natively using TensorFlow

1. Load data into memory then feed it to TensorFlow or Pytorch

This works by reading the data in memory using Pandas or similar packages, convert it into numpy arrays and then passing those to TensorFlow.
Examples:
- Read with Pandas and feed to TensorFlow
- Read with Pandas and feed to PyTorch
- Read using PySpark and feed data from memory to TensorFlow
- Read with PyArrow and feed to TensorFlow

2. Ingest Parquet files using Petastorm and feed them to TensorFlow or Pytorch

Petastorm is library that enables single machine or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format.
Examples:
- Petastorm + TensorFlow for the HLF classifier
- Petastorm + PyTorch for the HLF classifier
- Large Dataset: Petastorm + TensorFLow for the Inclusive classifier

3. Covert data into the TFRecord format and ingest the datasets natively with TensorFlow

TFRecord is a native data format for TensorFlow. Once data is converted in TFRecord format it can be processed natively in TensorFlow using tf.data and tf.io.
Examples:
- Particle classifier - High Level Features, data in TFRecord format
- Large Dataset: Particle classifier - Inclusive with LSTM, data in TFRecord format

Examples of how to convert Parquet files into TFRecord using Apache Spark