Skip to content

Latest commit

 

History

History
103 lines (82 loc) · 56.8 KB

dataset.md

File metadata and controls

103 lines (82 loc) · 56.8 KB

Dataset

Users can use Neural Compressor built-in dataset objects as well as register their own datasets.

Built-in dataset support list

Neural Compressor supports built-in dataloaders on popular industry datasets. Refer to this HelloWorld example to learn how to configure a built-in dataloader.

TensorFlow

Dataset Parameters Comments Usage
MNIST(root, train, transform, filter, download) root (str): Root directory of dataset
train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
If download is True, it will download dataset to root/MNIST/, otherwise user should put mnist.npz under root/MNIST/ manually. In yaml file:
dataset:
   MNIST:
     root: /path/to/root
     train: False
     download: True
(transform and filter are not set in the range of dataset)
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['MNIST'] (root=root, train=False, transform=transform, filter=None, download=True)
FashionMNIST(root, train, transform, filter, download) root (str): Root directory of dataset
train(bool, default=False): If True, creates dataset from train subset, otherwise from validation subset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
If download is True, it will download dataset to root/FashionMNIST/, otherwise user should put train-labels-idx1-ubyte.gz, train-images-idx3-ubyte.gz, t10k-labels-idx1-ubyte.gz and t10k-images-idx3-ubyte.gz under root/FashionMNIST/ manually. In yaml file:
dataset:
   FashionMNIST:
     root: /path/to/root
     train: False
     download: True
(transform and filter are not set in the range of dataset)
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['FashionMNIST'] (root=root, train=False, transform=transform, filter=None, download=True)
CIFAR10(root, train, transform, filter, download) root (str): Root directory of dataset
train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
If download is True, it will download dataset to root/ and extract it automatically, otherwise user can download file from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz manually to root/ and extract it. In yaml file:
dataset:
   CIFAR10:
     root: /path/to/root
     train: False
     download: True
(transform and filter are not set in the range of dataset)
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['CIFAR10'] (root=root, train=False, transform=transform, filter=None, download=True)
CIFAR100(root, train, transform, filter, download) root (str): Root directory of dataset
train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
If download is True, it will download dataset to root/ and extract it automatically, otherwise user can download file from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz manually to root/ and extract it. In yaml file:
dataset:
   CIFAR100:
     root: /path/to/root
     train: False
     download: True
(transform and filter are not set in the range of dataset)
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['CIFAR100'] (root=root, train=False, transform=transform, filter=None, download=True)
ImageRecord(root, transform, filter) root (str): Root directory of dataset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
Please arrange data in this way:
root/validation-000-of-100
root/validation-001-of-100
...
root/validation-099-of-100
The file name needs to follow this pattern: '* - * -of- *'
In yaml file:
dataset:
   ImageRecord:
     root: /path/to/root
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['ImageRecord'] (root=root, transform=transform, filter=None)
ImageFolder(root, transform, filter) root (str): Root directory of dataset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
Please arrange data in this way:
root/class_1/xxx.png
root/class_1/xxy.png
root/class_1/xxz.png
...
root/class_n/123.png
root/class_n/nsdf3.png
root/class_n/asd932_.png
Please put images of different categories into different folders.
In yaml file:
dataset:
   ImageFolder:
     root: /path/to/root
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['ImageFolder'] (root=root,transform=transform, filter=None)
ImagenetRaw(data_path, image_list, transform, filter) data_path (str): Root directory of dataset
image_list (str): data file, record image_names and their labels
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
Please arrange data in this way:
data_path/img1.jpg
data_path/img2.jpg
...
data_path/imgx.jpg
dataset will read name and label of each image from image_list file, if user set image_list to None, it will read from data_path/val_map.txt automatically.
In yaml file:
dataset:
   ImagenetRaw:
     data_path: /path/to/image
     image_list: /path/to/label
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['ImagenetRaw'] (data_path, image_list, transform=transform, filter=None)
COCORecord(root, num_cores, transform, filter) root (str): Root directory of dataset
num_cores (int, default=28):The number of input Datasets to interleave from in parallel
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
Root is a full path to tfrecord file, which contains the file name.
Please use Resize transform when batch_size > 1
In yaml file:
dataset:
   COCORecord:
     root: /path/to/tfrecord
     num_cores: 28
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['COCORecord'] (root, num_cores=28, transform=transform, filter=None)
COCORaw(root, img_dir, anno_dir, transform, filter) root (str): Root directory of dataset
img_dir (str, default='val2017'): image file directory
anno_dir (str, default='annotations/instances_val2017.json'): annotation file directory
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
Please arrange data in this way:
/root/img_dir/1.jpg
/root/img_dir/2.jpg
...
/root/img_dir/n.jpg
/root/anno_dir
Please use Resize transform when batch_size > 1
In yaml file:
dataset:
   COCORaw:
     root: /path/to/root
     img_dir: /path/to/image
     anno_dir: /path/to/annotation
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['COCORaw'] (root, img_dir, anno_dir, transform=transform, filter=None)
If anno_dir is not set, the dataset will use default label map
COCONpy(root, npy_dir, anno_dir) root (str): Root directory of dataset
npy_dir (str, default='val2017'): npy file directory
anno_dir (str, default='annotations/instances_val2017.json'): annotation file directory
Please arrange data in this way:
/root/npy_dir/1.jpg.npy
/root/npy_dir/2.jpg.npy
...
/root/npy_dir/n.jpg.npy
/root/anno_dir
Please use Resize transform when batch_size > 1
In yaml file:
dataset:
   COCORaw:
     root: /path/to/root
     npy_dir: /path/to/npy
     anno_dir: /path/to/annotation
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['COCONpy'] (root, npy_dir, anno_dir)
If anno_dir is not set, the dataset will use default label map
dummy(shape, low, high, dtype, label, transform, filter) shape (list or tuple):shape of total samples, the first dimension should be the sample count of the dataset. support create multi shape tensors, use list of tuples for each tuple in the list, will create a such size tensor.
low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value.
high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list
dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool'
label (bool, default=True):whether to return 0 as label
transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it.
filter (Filter objects, default=None): filter out examples according to specific conditions
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. In yaml file:
dataset:
   dummy:
     shape: [3, 224, 224, 3]
     low: 0.0
     high: 127.0
     dtype: float32
     label: True
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['dummy'] (shape, low, high, dtype, label, transform=None, filter=None)
dummy_v2(input_shape, label_shape, low, high, dtype, transform, filter) input_shape (list or tuple):create single or multi input tensors list represent the sample shape of the dataset, eg and image size should be represented as (224, 224, 3), tuple contains multiple list and represent multi input tensors.
label_shape (list or tuple):create single or multi label tensors list represent the sample shape of the label, eg and label size should be represented as (1,), tuple contains multiple list and represent multi label tensors. In yaml usage, it offers (1,) as the default value.
low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value.
high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list
dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool'
transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it.
filter (Filter objects, default=None): filter out examples according to specific conditions
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. In yaml file:
dataset:
   dummy_v2:
     input_shape: [224, 224, 3]
     label_shape: [1]
     low: 0.0
     high: 127.0
     dtype: float32

In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['dummy_v2'] (input_shape, low, high, dtype, transform=None, filter=None)
style_transfer(content_folder, style_folder, crop_ratio, resize_shape, image_format, transform, filter) content_folder (str):Root directory of content images
style_folder (str):Root directory of style images
crop_ratio (float, default=0.1):cropped ratio to each side
resize_shape (tuple, default=(256, 256)):target size of image
image_format (str, default='jpg'): target image format
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
Dataset used for style transfer task. This Dataset is to construct a dataset from two specific image holders representing content image folder and style image folder. In yaml file:
dataset:
   style_transfer:
     content_folder: /path/to/content_folder
     style_folder: /path/to/style_folder
     crop_ratio: 0.1
     resize_shape: [256, 256]
     image_format: 'jpg'
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['style_transfer'] (content_folder, style_folder, crop_ratio, resize_shape, image_format, transform=transform, filter=None)
TFRecordDataset(root, transform, filter) root (str): filename of dataset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
Root is a full path to tfrecord file, which contains the file name. In yaml file:
dataset:
   TFRecordDataset:
     root: /path/to/tfrecord
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['TFRecordDataset'] (root, transform=transform)
bert(root, label_file, task, transform, filter) root (str): path of dataset
label_file (str): path of label file
task (str, default='squad'): task type of model
model_type (str, default='bert'): model type, support 'bert'.
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
This dataset supports tfrecord data, please refer to Guide to create tfrecord file first. In yaml file:
dataset:
   bert:
     root: /path/to/root
     label_file: /path/to/label_file
     task: squad
     model_type: bert
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['bert'] (root, label_file, transform=transform)

PyTorch

Dataset Parameters Comments Usage
MNIST(root, train, transform, filter, download) root (str): Root directory of dataset
train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
If download is True, it will download dataset to root/MNIST/, otherwise user should put mnist.npz under root/MNIST/ manually. In yaml file:
dataset:
   MNIST:
     root: /path/to/root
     train: False
     download: True
(transform and filter are not set in the range of dataset)
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['MNIST'] (root=root, train=False, transform=transform, filter=None, download=True)
FashionMNIST(root, train, transform, filter, download) root (str): Root directory of dataset
train(bool, default=False): If True, creates dataset from train subset, otherwise from validation subset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
If download is True, it will download dataset to root/FashionMNIST/, otherwise user should put train-labels-idx1-ubyte.gz, train-images-idx3-ubyte.gz, t10k-labels-idx1-ubyte.gz and t10k-images-idx3-ubyte.gz under root/FashionMNIST/ manually. In yaml file:
dataset:
   FashionMNIST:
     root: /path/to/root
     train: False
     download: True
(transform and filter are not set in the range of dataset)
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['FashionMNIST'] (root=root, train=False, transform=transform, filter=None, download=True)
CIFAR10(root, train, transform, filter, download) root (str): Root directory of dataset
train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
If download is True, it will download dataset to root/ and extract it automatically, otherwise user can download file from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz manually to root/ and extract it. In yaml file:
dataset:
   CIFAR10:
     root: /path/to/root
     train: False
     download: True
(transform and filter are not set in the range of dataset)
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['CIFAR10'] (root=root, train=False, transform=transform, filter=None, download=True)
CIFAR100(root, train, transform, filter, download) root (str): Root directory of dataset
train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
If download is True, it will download dataset to root/ and extract it automatically, otherwise user can download file from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz manually to root/ and extract it. In yaml file:
dataset:
   CIFAR100:
     root: /path/to/root
     train: False
     download: True
(transform and filter are not set in the range of dataset)
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['CIFAR100'] (root=root, train=False, transform=transform, filter=None, download=True)
ImageFolder(root, transform, filter) root (str): Root directory of dataset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
Please arrange data in this way:
root/class_1/xxx.png
root/class_1/xxy.png
root/class_1/xxz.png
...
root/class_n/123.png
root/class_n/nsdf3.png
root/class_n/asd932_.png
Please put images of different categories into different folders.
In yaml file:
dataset:
   ImageFolder:
     root: /path/to/root
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['ImageFolder'] (root=root,transform=transform, filter=None)
ImagenetRaw(data_path, image_list, transform, filter) data_path (str): Root directory of dataset
image_list (str): data file, record image_names and their labels
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
Please arrange data in this way:
data_path/img1.jpg
data_path/img2.jpg
...
data_path/imgx.jpg
dataset will read name and label of each image from image_list file, if user set image_list to None, it will read from data_path/val_map.txt automatically.
In yaml file:
dataset:
   ImagenetRaw:
     data_path: /path/to/image
     image_list: /path/to/label
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['ImagenetRaw'] (data_path, image_list, transform=transform, filter=None)
COCORaw(root, img_dir, anno_dir, transform, filter) root (str): Root directory of dataset
img_dir (str, default='val2017'): image file directory
anno_dir (str, default='annotations/instances_val2017.json'): annotation file directory
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
Please arrange data in this way:
/root/img_dir/1.jpg
/root/img_dir/2.jpg
...
/root/img_dir/n.jpg
/root/anno_dir
Please use Resize transform when batch_size>1
In yaml file:
dataset:
   COCORaw:
     root: /path/to/root
     img_dir: /path/to/image
     anno_dir: /path/to/annotation
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['COCORaw'] (root, img_dir, anno_dir, transform=transform, filter=None)
If anno_dir is not set, the dataset will use default label map
dummy(shape, low, high, dtype, label, transform, filter) shape (list or tuple):shape of total samples, the first dimension should be the sample count of the dataset. support create multi shape tensors, use list of tuples for each tuple in the list, will create a such size tensor.
low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value.
high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list
dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool'
label (bool, default=True):whether to return 0 as label
transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it.
filter (Filter objects, default=None): filter out examples according to specific conditions
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. In yaml file:
dataset:
   dummy:
     shape: [3, 224, 224, 3]
     low: 0.0
     high: 127.0
     dtype: float32
     label: True
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['dummy'] (shape, low, high, dtype, label, transform=None, filter=None)
dummy_v2(input_shape, label_shape, low, high, dtype, transform, filter) input_shape (list or tuple):create single or multi input tensors list represent the sample shape of the dataset, eg and image size should be represented as (224, 224, 3), tuple contains multiple list and represent multi input tensors.
label_shape (list or tuple):create single or multi label tensors list represent the sample shape of the label, eg and label size should be represented as (1,), tuple contains multiple list and represent multi label tensors. In yaml usage, it offers (1,) as the default value.
low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value.
high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list
dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool'
transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it.
filter (Filter objects, default=None): filter out examples according to specific conditions
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. In yaml file:
dataset:
   dummy_v2:
     input_shape: [224, 224, 3]
     label_shape: [1]
     low: 0.0
     high: 127.0
     dtype: float32

In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['dummy_v2'] (input_shape, low, high, dtype, transform=None, filter=None)
bert(dataset, task, model_type, transform, filter) dataset (list): list of data
task (str): the task of the model, support "classifier", "squad"
model_type (str, default='bert'): model type, support 'distilbert', 'bert', 'xlnet', 'xlm'
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
This Dataset is to construct from the Bert TensorDataset and not a full implementation from yaml config. The original repo link is: https://github.com/huggingface/transformers. When you want use this Dataset, you should add it before you initialize your DataLoader. In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['bert'] (dataset, task, model_type, transform=transform, filter=None)
Now not support yaml implementation

MXNet

Dataset Parameters Comments Usage
MNIST(root, train, transform, filter, download) root (str): Root directory of dataset
train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
If download is True, it will download dataset to root/MNIST/, otherwise user should put mnist.npz under root/MNIST/ manually. In yaml file:
dataset:
   MNIST:
     root: /path/to/root
     train: False
     download: True
(transform and filter are not set in the range of dataset)
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['MNIST'] (root=root, train=False, transform=transform, filter=None, download=True)
FashionMNIST(root, train, transform, filter, download) root (str): Root directory of dataset
train(bool, default=False): If True, creates dataset from train subset, otherwise from validation subset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
If download is True, it will download dataset to root/FashionMNIST/, otherwise user should put train-labels-idx1-ubyte.gz, train-images-idx3-ubyte.gz, t10k-labels-idx1-ubyte.gz and t10k-images-idx3-ubyte.gz under root/FashionMNIST/ manually. In yaml file:
dataset:
   FashionMNIST:
     root: /path/to/root
     train: False
     download: True
(transform and filter are not set in the range of dataset)
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['FashionMNIST'] (root=root, train=False, transform=transform, filter=None, download=True)
CIFAR10(root, train, transform, filter, download) root (str): Root directory of dataset
train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
If download is True, it will download dataset to root/ and extract it automatically, otherwise user can download file from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz manually to root/ and extract it. In yaml file:
dataset:
   CIFAR10:
     root: /path/to/root
     train: False
     download: True
(transform and filter are not set in the range of dataset)
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['CIFAR10'] (root=root, train=False, transform=transform, filter=None, download=True)
CIFAR100(root, train, transform, filter, download) root (str): Root directory of dataset
train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
If download is True, it will download dataset to root/ and extract it automatically, otherwise user can download file from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz manually to root/ and extract it. In yaml file:
dataset:
   CIFAR100:
     root: /path/to/root
     train: False
     download: True
(transform and filter are not set in the range of dataset)
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['CIFAR100'] (root=root, train=False, transform=transform, filter=None, download=True)
ImageFolder(root, transform, filter) root (str): Root directory of dataset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
Please arrange data in this way:
root/class_1/xxx.png
root/class_1/xxy.png
root/class_1/xxz.png
...
root/class_n/123.png
root/class_n/nsdf3.png
root/class_n/asd932_.png
Please put images of different categories into different folders.
In yaml file:
dataset:
   ImageFolder:
     root: /path/to/root
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['ImageFolder'] (root=root,transform=transform, filter=None)
ImagenetRaw(data_path, image_list, transform, filter) data_path (str): Root directory of dataset
image_list (str): data file, record image_names and their labels
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
Please arrange data in this way:
data_path/img1.jpg
data_path/img2.jpg
...
data_path/imgx.jpg
dataset will read name and label of each image from image_list file, if user set image_list to None, it will read from data_path/val_map.txt automatically.
In yaml file:
dataset:
   ImagenetRaw:
     data_path: /path/to/image
     image_list: /path/to/label
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['ImagenetRaw'] (data_path, image_list, transform=transform, filter=None)
COCORaw(root, img_dir, anno_dir, transform, filter) root (str): Root directory of dataset
img_dir (str, default='val2017'): image file directory
anno_dir (str, default='annotations/instances_val2017.json'): annotation file directory
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
Please arrange data in this way:
/root/img_dir/1.jpg
/root/img_dir/2.jpg
...
/root/img_dir/n.jpg
/root/anno_dir
Please use Resize transform when batch_size > 1
In yaml file:
dataset:
   COCORaw:
     root: /path/to/root
     img_dir: /path/to/image
     anno_dir: /path/to/annotation
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['COCORaw'] (root, img_dir, anno_dir, transform=transform, filter=None)
If anno_dir is not set, the dataset will use default label map
dummy(shape, low, high, dtype, label, transform, filter) shape (list or tuple):shape of total samples, the first dimension should be the sample count of the dataset. support create multi shape tensors, use list of tuples for each tuple in the list, will create a such size tensor.
low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value.
high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list
dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool'
label (bool, default=True):whether to return 0 as label
transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it.
filter (Filter objects, default=None): filter out examples according to specific conditions
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. In yaml file:
dataset:
   dummy:
     shape: [3, 224, 224, 3]
     low: 0.0
     high: 127.0
     dtype: float32
     label: True
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['dummy'] (shape, low, high, dtype, label, transform=None, filter=None)
dummy_v2(input_shape, label_shape, low, high, dtype, transform, filter) input_shape (list or tuple):create single or multi input tensors list represent the sample shape of the dataset, eg and image size should be represented as (224, 224, 3), tuple contains multiple list and represent multi input tensors.
label_shape (list or tuple):create single or multi label tensors list represent the sample shape of the label, eg and label size should be represented as (1,), tuple contains multiple list and represent multi label tensors. In yaml usage, it offers (1,) as the default value.
low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value.
high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list
dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool'
transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it.
filter (Filter objects, default=None): filter out examples according to specific conditions
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. In yaml file:
dataset:
   dummy_v2:
     input_shape: [224, 224, 3]
     label_shape: [1]
     low: 0.0
     high: 127.0
     dtype: float32

In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['dummy_v2'] (input_shape, low, high, dtype, transform=None, filter=None)

ONNXRT

Dataset Parameters Comments Usage
MNIST(root, train, transform, filter, download) root (str): Root directory of dataset
train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
If download is True, it will download dataset to root/MNIST/, otherwise user should put mnist.npz under root/MNIST/ manually. In yaml file:
dataset:
   MNIST:
     root: /path/to/root
     train: False
     download: True
(transform and filter are not set in the range of dataset)
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['MNIST'] (root=root, train=False, transform=transform, filter=None, download=True)
FashionMNIST(root, train, transform, filter, download) root (str): Root directory of dataset
train(bool, default=False): If True, creates dataset from train subset, otherwise from validation subset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
If download is True, it will download dataset to root/FashionMNIST/, otherwise user should put train-labels-idx1-ubyte.gz, train-images-idx3-ubyte.gz, t10k-labels-idx1-ubyte.gz and t10k-images-idx3-ubyte.gz under root/FashionMNIST/ manually. In yaml file:
dataset:
   FashionMNIST:
     root: /path/to/root
     train: False
     download: True
(transform and filter are not set in the range of dataset)
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['FashionMNIST'] (root=root, train=False, transform=transform, filter=None, download=True)
CIFAR10(root, train, transform, filter, download) root (str): Root directory of dataset
train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
If download is True, it will download dataset to root/ and extract it automatically, otherwise user can download file from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz manually to root/ and extract it. In yaml file:
dataset:
   CIFAR10:
     root: /path/to/root
     train: False
     download: True
(transform and filter are not set in the range of dataset)
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['CIFAR10'] (root=root, train=False, transform=transform, filter=None, download=True)
CIFAR100(root, train, transform, filter, download) root (str): Root directory of dataset
train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
If download is True, it will download dataset to root/ and extract it automatically, otherwise user can download file from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz manually to root/ and extract it. In yaml file:
dataset:
   CIFAR100:
     root: /path/to/root
     train: False
     download: True
(transform and filter are not set in the range of dataset)
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['CIFAR100'] (root=root, train=False, transform=transform, filter=None, download=True)
ImageFolder(root, transform, filter) root (str): Root directory of dataset
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
Please arrange data in this way:
root/class_1/xxx.png
root/class_1/xxy.png
root/class_1/xxz.png
...
root/class_n/123.png
root/class_n/nsdf3.png
root/class_n/asd932_.png
Please put images of different categories into different folders.
In yaml file:
dataset:
   ImageFolder:
     root: /path/to/root
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['ImageFolder'] (root=root,transform=transform, filter=None)
ImagenetRaw(data_path, image_list, transform, filter) data_path (str): Root directory of dataset
image_list (str): data file, record image_names and their labels
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
Please arrange data in this way:
data_path/img1.jpg
data_path/img2.jpg
...
data_path/imgx.jpg
dataset will read name and label of each image from image_list file, if user set image_list to None, it will read from data_path/val_map.txt automatically.
In yaml file:
dataset:
   ImagenetRaw:
     data_path: /path/to/image
     image_list: /path/to/label
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['ImagenetRaw'] (data_path, image_list, transform=transform, filter=None)
COCORaw(root, img_dir, anno_dir, transform, filter) root (str): Root directory of dataset
img_dir (str, default='val2017'): image file directory
anno_dir (str, default='annotations/instances_val2017.json'): annotation file directory
transform (transform object, default=None): transform to process input data
filter (Filter objects, default=None): filter out examples according to specific conditions
Please arrange data in this way:
/root/img_dir/1.jpg
/root/img_dir/2.jpg
...
/root/img_dir/n.jpg
/root/anno_dir
*Please use Resize transform when batch_size > 1
In yaml file:
dataset:
   COCORaw:
     root: /path/to/root
     img_dir: /path/to/image
     anno_dir: /path/to/annotation
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['COCORaw'] (root, img_dir, anno_dir, transform=transform, filter=None)
If anno_dir is not set, the dataset will use default label map
dummy(shape, low, high, dtype, label, transform, filter) shape (list or tuple):shape of total samples, the first dimension should be the sample count of the dataset. support create multi shape tensors, use list of tuples for each tuple in the list, will create a such size tensor.
low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value.
high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list
dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool'
label (bool, default=True):whether to return 0 as label
transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it.
filter (Filter objects, default=None): filter out examples according to specific conditions
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. In yaml file:
dataset:
   dummy:
     shape: [3, 224, 224, 3]
     low: 0.0
     high: 127.0
     dtype: float32
     label: True
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['dummy'] (shape, low, high, dtype, label, transform=None, filter=None)
dummy_v2(input_shape, label_shape, low, high, dtype, transform, filter) input_shape (list or tuple):create single or multi input tensors list represent the sample shape of the dataset, eg and image size should be represented as (224, 224, 3), tuple contains multiple list and represent multi input tensors.
label_shape (list or tuple):create single or multi label tensors list represent the sample shape of the label, eg and label size should be represented as (1,), tuple contains multiple list and represent multi label tensors. In yaml usage, it offers (1,) as the default value.
low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value.
high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list
dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool'
transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it.
filter (Filter objects, default=None): filter out examples according to specific conditions
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. In yaml file:
dataset:
   dummy_v2:
     input_shape: [224, 224, 3]
     label_shape: [1]
     low: 0.0
     high: 127.0
     dtype: float32

In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['dummy_v2'] (input_shape, low, high, dtype, transform=None, filter=None)
GLUE(data_dir, model_name_or_path, max_seq_length, do_lower_case, task, model_type, dynamic_length, evaluate, transform, filter) data_dir (str): The input data dir
model_name_or_path (str): Path to pre-trained student model or shortcut name,
max_seq_length (int, default=128): The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
do_lower_case (bool, default=True): Whether or not to lowercase the input.
task (bool, default=True): The name of the task to fine-tune. Choices include mrpc, qqp, qnli, rte, sts-b, cola, mnli, wnli.
model_type (str, default='bert'): model type, support 'distilbert', 'bert', 'mobilebert', 'roberta'.
dynamic_length (bool, default=False): Whether to use fixed sequence length.
evaluate (bool, default=True): Whether do evaluation or training.
transform (bool, default=True): If true,
filter (bool, default=True): If true,
Refer to this example on how to prepare dataset In yaml file:
dataset:
   bert:
     data_dir: False
     model_name_or_path: True
(transform and filter are not set in the range of dataset)
In user code:
from neural_compressor.experimental.data import DATASETS
datasets = DATASETS(framework)
dataset = datasets['bert'] (data_dir='/path/to/data/', model_name_or_path='bert-base-uncased', max_seq_length=128, task='mrpc', model_type='bert', dynamic_length=True, transform=None, filter=None)

User-specific dataset

Users can register their own datasets as follows:

class Dataset(object):
    def __init__(self, args):
        # init code here

    def __getitem__(self, idx):
        # use idx to get data and label
        return data, label

    def __len__(self):
        return len

After defining the dataset class, pass it to the quantizer:

from neural_compressor import Quantization, common
quantizer = Quantization(yaml_file)
quantizer.calib_dataloader = common.DataLoader(dataset) # user can pass more optional args to dataloader such as batch_size and collate_fn
quantizer.model = common.Model(graph)
quantizer.eval_func = eval_func
q_model = quantizer()