This document provides a comprehensive guide on preparing custom datasets for training, supporting both image and video data.
Our training pipeline accommodates the following types of datasets:
-
Image Datasets: Designed for training models on static visual content, including:
- Single image conversations
- Multi-image conversations
-
Video Datasets: Designed for training models on dynamic visual content, including:
- Single video conversations only
Image datasets are used for training models to understand and analyze visual content. This section explains how to structure and format your image dataset.
Organize your image dataset in your local filesystem as shown below:
/path/to/image_dataset/
│
├── train.jsonl
├── test.jsonl (optional)
└── image_folder/
├── image1.jpg
├── image2.png
└── ...
train.jsonl
: Contains the training data samples.test.jsonl
: (Optional) Contains the test data samples for evaluation.image_folder/
: Directory containing all the images referenced in the JSONL files.
The train.jsonl
and test.jsonl
files should contain JSON objects, each representing a single data sample. The structure is as follows:
{
"messages": [
{
"role": "user" or "assistant",
"content": [
{
"type": "text" or "image",
"text": "string (only if type is text)"
}
]
}
],
"images": ["relative_path_to_image1", "relative_path_to_image2", ...]
}
Key components:
messages
: An array of message objects representing the conversation.role
: Indicates whether the message is from the "user" or the "assistant".content
: An array of content objects, which can be either text or image.images
: An array of relative paths to the image files used in the conversation.
Here's an example of a data sample with a single image:
{
"messages": [
{
"role": "user",
"content": [
{
"type": "image",
"text": null,
},
{
"type": "text",
"text": "What's in this image?"
}
]
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "The image shows a cat sitting on a windowsill."
}
]
}
],
"images": ["image_folder/cat_on_windowsill.jpg"]
}
This example demonstrates a simple question-answer interaction about a single image.
For scenarios involving multiple images, structure your data sample like this:
{
"messages": [
{
"role": "user",
"content": [
{
"type": "image",
"text": null,
},
{
"type": "image",
"text": null,
},
{
"type": "text",
"text": "What's in these images?"
}
]
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "The first image shows a cat, and the second image shows a dog."
}
]
}
],
"images": ["image_folder/cat.jpg", "image_folder/dog.jpg"]
}
Note: The length of the
images
array should match the number ofcontent
objects withtype
set toimage
. Additionally, the order of theimages
array should correspond to the order of thecontent
objects withtype
set toimage
.
Video datasets are used for training models to understand and analyze moving visual content. This section explains how to structure and format your video dataset.
Organize your video dataset in your local filesystem as shown below:
/path/to/video_dataset/
│
├── train.jsonl
├── test.jsonl (optional)
└── video_folder/
├── video1.mp4
├── video2.avi
└── ...
train.jsonl
: Contains the training data samples.test.jsonl
: (Optional) Contains the test data samples for evaluation.video_folder/
: Directory containing all the video files referenced in the JSONL files.
The train.jsonl
and test.jsonl
files should contain JSON objects, each representing a single data sample. The structure for video datasets is as follows:
{
"messages": [
{
"role": "user" or "assistant",
"content": [
{
"type": "text" or "video",
"text": "string (only if type is text)"
}
]
}
],
"video": {
"path": "relative_path_to_video",
"num_frames": integer
}
}
Key components:
messages
: An array of message objects representing the conversation.role
: Indicates whether the message is from the "user" or the "assistant".content
: An array of content objects, which can be either text or video.video
: An object containing information about the video file.path
: The relative path to the video file.num_frames
: The number of frames you want to extract from the video.
Here's an example of a data sample for a video dataset:
{
"messages": [
{
"role": "user",
"content": [
{
"type": "video",
"text": null,
},
{
"type": "text",
"text": "What's happening in this video?"
}
]
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "The video shows a cat playing with a toy mouse."
}
]
}
],
"video": {
"path": "video_folder/cat_playing.mp4",
"num_frames": 150
}
}
- Each line in the JSONL files represents a single data sample for both image and video datasets.
- Ensure that all file paths in the JSONL files are relative to the dataset root directory.
- For image datasets, multiple images can be referenced in a single data sample.
- For video datasets, each data sample typically refers to a single video file.
By following this guide, you can prepare your custom image and video datasets in a format that is compatible with the training pipeline, enabling effective model training on your specific data.