Three methods to feed Parquet data into TensorFlow or PyTorch

This note documents three easy methods to help you get started with data pipelines for TensorFlow or PyTorch. For simplicity, we assume the data source to be Apache Parquet files, but this can be easily extended to other data formats.
Choose one of these methods, as it best fits your problem:

Load data into memory and feed it to TensorFlow or Pytorch
Use the Petastorm library to load Parquet and feed it to TensorFlow or Pytorch
Convert data to the TFRecord data format and process it natively using TensorFlow

1. Load data into memory then feed it to TensorFlow or Pytorch

This works by reading the data in memory using Pandas or similar packages, convert it into numpy arrays and then passing those to TensorFlow.
Examples:
- Read with Pandas and feed to TensorFlow
- Read with Pandas and feed to PyTorch
- Read using PySpark and feed data from memory to TensorFlow
- Read with PyArrow and feed to TensorFlow

2. Ingest Parquet files using Petastorm and feed them to TensorFlow or Pytorch

Petastorm is library that enables single machine or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format.
Examples:
- Petastorm + TensorFlow for the HLF classifier
- Petastorm + PyTorch for the HLF classifier
- Large Dataset: Petastorm + TensorFLow for the Inclusive classifier

3. Covert data into the TFRecord format and ingest the datasets natively with TensorFlow

TFRecord is a native data format for TensorFlow. Once data is converted in TFRecord format it can be processed natively in TensorFlow using tf.data and tf.io.
Examples:
- Particle classifier - High Level Features, data in TFRecord format
- Large Dataset: Particle classifier - Inclusive with LSTM, data in TFRecord format

Examples of how to convert Parquet files into TFRecord using Apache Spark

See Note on using Spark to process data in TFRecord format

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet_to_Tensorflow_PyTorch_HowTo.md

Parquet_to_Tensorflow_PyTorch_HowTo.md

Three methods to feed Parquet data into TensorFlow or PyTorch

1. Load data into memory then feed it to TensorFlow or Pytorch

2. Ingest Parquet files using Petastorm and feed them to TensorFlow or Pytorch

3. Covert data into the TFRecord format and ingest the datasets natively with TensorFlow

Examples of how to convert Parquet files into TFRecord using Apache Spark

Files

Parquet_to_Tensorflow_PyTorch_HowTo.md

Latest commit

History

Parquet_to_Tensorflow_PyTorch_HowTo.md

File metadata and controls

Three methods to feed Parquet data into TensorFlow or PyTorch

1. Load data into memory then feed it to TensorFlow or Pytorch

2. Ingest Parquet files using Petastorm and feed them to TensorFlow or Pytorch

3. Covert data into the TFRecord format and ingest the datasets natively with TensorFlow

Examples of how to convert Parquet files into TFRecord using Apache Spark