Split subset #1281

CourchesneA · 2024-02-28T13:38:17Z

I have a case where my dataset comes already split in "train" and "test", but I would need to add a validation set.
It seems like the "split" transform is unable to do this, it would only merge everything together as a first step.

Is there a way to acheive this ? I would like either to be able to specifcy a subset in the "split" transform, or execute the split on a subset and then reassign / overwrite an existing subset of my original dataset.

ex. before:

subsets
	train: # of items=227, # of annotated items=227, # of annotations=604, annotation types=['polygon', 'bbox']
	test: # of items=10, # of annotated items=10, # of annotations=20, annotation types=['polygon', 'bbox']

after:

subsets
	train: # of items=207, # of annotated items=10, # of annotations=557, annotation types=['polygon', 'bbox']
	val: # of items=22, # of annotated items=10, # of annotations=47, annotation types=['polygon', 'bbox']
	test: # of items=10, # of annotated items=10, # of annotations=20, annotation types=['polygon', 'bbox']

I would need the test set to be untouched, i.e. it should contain the same items as before

The text was updated successfully, but these errors were encountered:

vinnamkim · 2024-02-29T08:52:20Z

Hi @CourchesneA,
Thanks for your interests on our project. Unfortunately, there is no single command for your requirement. However, I think that we can use multiple commands for it. Let me show this sample code.

Create a synthetic dataset (can be skipped and use yours)

import datumaro as dm
import numpy as np

# Create a synthetic dataset from code
src_dataset = dm.Dataset.from_iterable(
    [
        dm.DatasetItem(
            id=f"{subset}_{idx}",
            subset=subset,
            media=dm.Image.from_numpy(np.zeros([3, 10, 10])),
            annotations=[dm.Label(label=idx % 2)]
        )
        for idx in range(20)
        for subset in ["train", "test"]
    ],
    categories=["cat", "dog"],
)
print(src_dataset)

Dataset
	size=40
	source_path=None
	media_type=<class 'datumaro.components.media.Image'>
	annotated_items_count=40
	annotations_count=40
subsets
	test: # of items=20, # of annotated items=20, # of annotations=20, annotation types=['label']
	train: # of items=20, # of annotated items=20, # of annotations=20, annotation types=['label']
infos
	categories
	label: ['cat', 'dog']

Split subsets into complete dm.Dataset

train_only_dataset = dm.Dataset(source=src_dataset.get_subset("train"))
test_only_dataset = dm.Dataset(source=src_dataset.get_subset("test"))
print(train_only_dataset)

Dataset
	size=20
	source_path=None
	media_type=<class 'datumaro.components.media.Image'>
	annotated_items_count=20
	annotations_count=20
subsets
	train: # of items=20, # of annotated items=20, # of annotations=20, annotation types=['label']
infos
	categories
	label: ['cat', 'dog']

Apply random split transform to train_only_dataset

train_val_dataset = train_only_dataset.transform(
    "random_split",
    splits=[("train", 0.67), ("val", 0.33)],
)
print(train_val_dataset)

Dataset
	size=20
	source_path=None
	media_type=<class 'datumaro.components.media.Image'>
	annotated_items_count=20
	annotations_count=20
subsets
	train: # of items=13, # of annotated items=13, # of annotations=13, annotation types=['label']
	val: # of items=7, # of annotated items=7, # of annotations=7, annotation types=['label']
infos
	categories
	label: ['cat', 'dog']

Merge train_val_dataset and test_only_dataset into one dm.Dataset

dst_dataset = dm.HLOps.merge(train_val_dataset, test_only_dataset)
print(dst_dataset)

Dataset
	size=40
	source_path=None
	media_type=<class 'datumaro.components.media.Image'>
	annotated_items_count=40
	annotations_count=40
subsets
	test: # of items=20, # of annotated items=20, # of annotations=20, annotation types=['label']
	train: # of items=13, # of annotated items=13, # of annotations=13, annotation types=['label']
	val: # of items=7, # of annotated items=7, # of annotations=7, annotation types=['label']
infos
	categories
	label: ['cat', 'dog']

CourchesneA · 2024-02-29T13:43:37Z

That's exactly what I was looking for, thanks for the detailed example !

github-actions bot assigned vinnamkim Feb 28, 2024

vinnamkim added the user experience Questions about our products or things to improve user experience label Feb 29, 2024

CourchesneA closed this as completed Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split subset #1281

Split subset #1281

CourchesneA commented Feb 28, 2024

vinnamkim commented Feb 29, 2024

CourchesneA commented Feb 29, 2024

Split subset #1281

Split subset #1281

Comments

CourchesneA commented Feb 28, 2024

vinnamkim commented Feb 29, 2024

CourchesneA commented Feb 29, 2024