KeyError: dataset has no key "image" #6069

etetteh · 2023-07-25T17:45:50Z

Describe the bug

I've loaded a local image dataset with:
ds = laod_dataset("imagefolder", data_dir=path-to-data)

And defined a transform to process the data, following the Datasets docs.

However, I get a keyError error, indicating there's no "image" key in my dataset. When I printed out the example_batch sent to the transformation function, it shows only the labels are being sent to the function.
For some reason, the images are not in the example batches.

Steps to reproduce the bug

I'm using the latest stable version of datasets

Expected behavior

I expect the example_batches to contain both images and labels

Environment info

I'm using the latest stable version of datasets

The text was updated successfully, but these errors were encountered:

mariosasko · 2023-07-26T13:07:28Z

You can list the dataset's columns with ds.column_names before .map to check whether the dataset has an image column. If it doesn't, then this is a bug. Otherwise, please paste the line with the .map call.

etetteh · 2023-07-26T14:48:26Z

This is the piece of code I am running:

data_transforms = utils.get_data_augmentation(args)
image_dataset = utils.load_image_dataset(args.dataset)

def resize(examples):
    examples["pixel_values"] = [image.convert("RGB").resize((300, 300)) for image in examples["image"]]
    return examples

def preprocess_train(example_batch):
    print(f"Example batch: \n{example_batch}")
    example_batch["pixel_values"] = [
        data_transforms["train"](image.convert("RGB")) for image in example_batch["pixel_values"]
    ]
    return example_batch

def preprocess_val(example_batch):
    example_batch["pixel_values"] = [
        data_transforms["val"](image.convert("RGB")) for image in example_batch["pixel_values"]
    ]
    return example_batch

image_dataset = image_dataset.map(resize, remove_columns=["image"], batched=True)

image_dataset["train"].set_transform(preprocess_train)
image_dataset["validation"].set_transform(preprocess_val)

When I print ds.column_names I get the following
{'train': ['image', 'label'], 'validation': ['image', 'label'], 'test': ['image', 'label']}

The print(f"Example batch: \n{example_batch}") in the preprocess_train function outputs only labels without images:

Example batch: 
{'label': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]}

The weird part of it all is that a sample code runs in a jupyter lab notebook without any bugs, but when I run my scripts from the terminal I get the bug. The same code.

mariosasko · 2023-07-26T15:00:23Z

The remove_columns=["image"] argument in the .map call removes the image column from the output, so drop this argument to preserve it.

etetteh · 2023-07-26T15:18:51Z

The problem is not with the removal of the image key. The bug is why only the labels are sent to be process, instead of all the featues or dictionary keys.

P.S. I just dropped the removal argument as you've suggested, but that didn't solve the problem, because only the labels are being sent to be processed

mariosasko · 2023-07-26T17:33:49Z

All the image_dataset.column_names after the map call should also be present in preprocess_train /preprocess_val unless (input) columns in set_transform are specified.

If that's not the case, we need a full reproducer (not snippets) with the environment info.

etetteh · 2023-07-27T12:42:02Z

I have resolved the error after including a collate function as indicated in the Quick Start session of the Datasets docs.:

Here is what I did:

data_transforms = utils.get_data_augmentation(args)
image_dataset = utils.load_image_dataset(args.dataset)

def preprocess_train(example_batch):
    example_batch["pixel_values"] = [
        data_transforms["train"](image.convert("RGB")) for image in example_batch["image"]
    ]
    return example_batch

def preprocess_val(example_batch):
    example_batch["pixel_values"] = [
        data_transforms["val"](image.convert("RGB")) for image in example_batch["image"]
    ]
    return example_batch

def collate_fn(examples):
    images = []
    labels = []
    for example in examples:
        images.append((example["pixel_values"]))
        labels.append(example["label"])

    pixel_values = torch.stack(images)
    labels = torch.tensor(labels)
    return {"pixel_values": pixel_values, "label": labels}

train_dataset = image_dataset["train"].with_transform(preprocess_train)
val_dataset = image_dataset["validation"].with_transform(preprocess_val)

image_datasets = {
    "train": train_dataset,
    "val": val_dataset
}

samplers = {
    "train": data.RandomSampler(train_dataset),
    "val": data.SequentialSampler(val_dataset),
}

dataloaders = {
    x: data.DataLoader(
        image_datasets[x],
        collate_fn=collate_fn,
        batch_size=batch_size,
        sampler=samplers[x],
        num_workers=args.num_workers,
        worker_init_fn=utils.set_seed_for_worker,
        generator=g,
        pin_memory=True,
    )
    for x in ["train", "val"]
}

train_loader, val_loader = dataloaders["train"], dataloaders["val"]

Everything runs fine without any bug now.

sralvins · 2024-09-06T08:16:15Z

are you using hf Trainer? hf trainer will remove columns not used in model.forward. set remove_unused_columns=False might works

etetteh closed this as completed Jul 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError: dataset has no key "image" #6069

KeyError: dataset has no key "image" #6069

etetteh commented Jul 25, 2023

mariosasko commented Jul 26, 2023

etetteh commented Jul 26, 2023

mariosasko commented Jul 26, 2023

etetteh commented Jul 26, 2023

mariosasko commented Jul 26, 2023

etetteh commented Jul 27, 2023

sralvins commented Sep 6, 2024

KeyError: dataset has no key "image" #6069

KeyError: dataset has no key "image" #6069

Comments

etetteh commented Jul 25, 2023

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

mariosasko commented Jul 26, 2023

etetteh commented Jul 26, 2023

mariosasko commented Jul 26, 2023

etetteh commented Jul 26, 2023

mariosasko commented Jul 26, 2023

etetteh commented Jul 27, 2023

sralvins commented Sep 6, 2024