This note documents three easy methods to help you get started with data pipelines for TensorFlow or PyTorch.
For simplicity, we assume the data source to be Apache Parquet files, but this can be easily extended to other data formats.
Choose one of these methods, as it best fits your problem:
- Load data into memory and feed it to TensorFlow or Pytorch
- Use the Petastorm library to load Parquet and feed it to TensorFlow or Pytorch
- Convert data to the TFRecord data format and process it natively using TensorFlow
This works by reading the data in memory using Pandas or similar packages, convert it into numpy arrays
and then passing those to TensorFlow.
Examples:
- Read with Pandas and feed to TensorFlow
- Read with Pandas and feed to PyTorch
- Read using PySpark and feed data from memory to TensorFlow
- Read with PyArrow and feed to TensorFlow
Petastorm is library that enables single machine or distributed training and
evaluation of deep learning models directly from datasets in Apache Parquet format.
Examples:
- Petastorm + TensorFlow for the HLF classifier
- Petastorm + PyTorch for the HLF classifier
- Large Dataset: Petastorm + TensorFLow for the Inclusive classifier
TFRecord is a native data format for TensorFlow. Once data is converted in TFRecord format it
can be processed natively in TensorFlow using tf.data and tf.io.
Examples:
- Particle classifier - High Level Features, data in TFRecord format
- Large Dataset: Particle classifier - Inclusive with LSTM, data in TFRecord format