Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split subset #1281

Closed
CourchesneA opened this issue Feb 28, 2024 · 2 comments
Closed

Split subset #1281

CourchesneA opened this issue Feb 28, 2024 · 2 comments
Assignees
Labels
user experience Questions about our products or things to improve user experience

Comments

@CourchesneA
Copy link

I have a case where my dataset comes already split in "train" and "test", but I would need to add a validation set.
It seems like the "split" transform is unable to do this, it would only merge everything together as a first step.

Is there a way to acheive this ? I would like either to be able to specifcy a subset in the "split" transform, or execute the split on a subset and then reassign / overwrite an existing subset of my original dataset.

ex. before:

subsets
	train: # of items=227, # of annotated items=227, # of annotations=604, annotation types=['polygon', 'bbox']
	test: # of items=10, # of annotated items=10, # of annotations=20, annotation types=['polygon', 'bbox']

after:

subsets
	train: # of items=207, # of annotated items=10, # of annotations=557, annotation types=['polygon', 'bbox']
	val: # of items=22, # of annotated items=10, # of annotations=47, annotation types=['polygon', 'bbox']
	test: # of items=10, # of annotated items=10, # of annotations=20, annotation types=['polygon', 'bbox']

I would need the test set to be untouched, i.e. it should contain the same items as before

@vinnamkim
Copy link
Contributor

Hi @CourchesneA,
Thanks for your interests on our project. Unfortunately, there is no single command for your requirement. However, I think that we can use multiple commands for it. Let me show this sample code.

  1. Create a synthetic dataset (can be skipped and use yours)
import datumaro as dm
import numpy as np

# Create a synthetic dataset from code
src_dataset = dm.Dataset.from_iterable(
    [
        dm.DatasetItem(
            id=f"{subset}_{idx}",
            subset=subset,
            media=dm.Image.from_numpy(np.zeros([3, 10, 10])),
            annotations=[dm.Label(label=idx % 2)]
        )
        for idx in range(20)
        for subset in ["train", "test"]
    ],
    categories=["cat", "dog"],
)
print(src_dataset)
Dataset
	size=40
	source_path=None
	media_type=<class 'datumaro.components.media.Image'>
	annotated_items_count=40
	annotations_count=40
subsets
	test: # of items=20, # of annotated items=20, # of annotations=20, annotation types=['label']
	train: # of items=20, # of annotated items=20, # of annotations=20, annotation types=['label']
infos
	categories
	label: ['cat', 'dog']
  1. Split subsets into complete dm.Dataset
train_only_dataset = dm.Dataset(source=src_dataset.get_subset("train"))
test_only_dataset = dm.Dataset(source=src_dataset.get_subset("test"))
print(train_only_dataset)
Dataset
	size=20
	source_path=None
	media_type=<class 'datumaro.components.media.Image'>
	annotated_items_count=20
	annotations_count=20
subsets
	train: # of items=20, # of annotated items=20, # of annotations=20, annotation types=['label']
infos
	categories
	label: ['cat', 'dog']
  1. Apply random split transform to train_only_dataset
train_val_dataset = train_only_dataset.transform(
    "random_split",
    splits=[("train", 0.67), ("val", 0.33)],
)
print(train_val_dataset)
Dataset
	size=20
	source_path=None
	media_type=<class 'datumaro.components.media.Image'>
	annotated_items_count=20
	annotations_count=20
subsets
	train: # of items=13, # of annotated items=13, # of annotations=13, annotation types=['label']
	val: # of items=7, # of annotated items=7, # of annotations=7, annotation types=['label']
infos
	categories
	label: ['cat', 'dog']
  1. Merge train_val_dataset and test_only_dataset into one dm.Dataset
dst_dataset = dm.HLOps.merge(train_val_dataset, test_only_dataset)
print(dst_dataset)
Dataset
	size=40
	source_path=None
	media_type=<class 'datumaro.components.media.Image'>
	annotated_items_count=40
	annotations_count=40
subsets
	test: # of items=20, # of annotated items=20, # of annotations=20, annotation types=['label']
	train: # of items=13, # of annotated items=13, # of annotations=13, annotation types=['label']
	val: # of items=7, # of annotated items=7, # of annotations=7, annotation types=['label']
infos
	categories
	label: ['cat', 'dog']

@vinnamkim vinnamkim added the user experience Questions about our products or things to improve user experience label Feb 29, 2024
@CourchesneA
Copy link
Author

That's exactly what I was looking for, thanks for the detailed example !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user experience Questions about our products or things to improve user experience
Projects
None yet
Development

No branches or pull requests

2 participants