Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: dataset has no key "image" #6069

Closed
etetteh opened this issue Jul 25, 2023 · 7 comments
Closed

KeyError: dataset has no key "image" #6069

etetteh opened this issue Jul 25, 2023 · 7 comments

Comments

@etetteh
Copy link

etetteh commented Jul 25, 2023

Describe the bug

I've loaded a local image dataset with:
ds = laod_dataset("imagefolder", data_dir=path-to-data)

And defined a transform to process the data, following the Datasets docs.

However, I get a keyError error, indicating there's no "image" key in my dataset. When I printed out the example_batch sent to the transformation function, it shows only the labels are being sent to the function.
For some reason, the images are not in the example batches.

Steps to reproduce the bug

I'm using the latest stable version of datasets

Expected behavior

I expect the example_batches to contain both images and labels

Environment info

I'm using the latest stable version of datasets

@mariosasko
Copy link
Collaborator

You can list the dataset's columns with ds.column_names before .map to check whether the dataset has an image column. If it doesn't, then this is a bug. Otherwise, please paste the line with the .map call.

@etetteh
Copy link
Author

etetteh commented Jul 26, 2023

This is the piece of code I am running:

data_transforms = utils.get_data_augmentation(args)
image_dataset = utils.load_image_dataset(args.dataset)

def resize(examples):
    examples["pixel_values"] = [image.convert("RGB").resize((300, 300)) for image in examples["image"]]
    return examples

def preprocess_train(example_batch):
    print(f"Example batch: \n{example_batch}")
    example_batch["pixel_values"] = [
        data_transforms["train"](image.convert("RGB")) for image in example_batch["pixel_values"]
    ]
    return example_batch

def preprocess_val(example_batch):
    example_batch["pixel_values"] = [
        data_transforms["val"](image.convert("RGB")) for image in example_batch["pixel_values"]
    ]
    return example_batch

image_dataset = image_dataset.map(resize, remove_columns=["image"], batched=True)

image_dataset["train"].set_transform(preprocess_train)
image_dataset["validation"].set_transform(preprocess_val)

When I print ds.column_names I get the following
{'train': ['image', 'label'], 'validation': ['image', 'label'], 'test': ['image', 'label']}

The print(f"Example batch: \n{example_batch}") in the preprocess_train function outputs only labels without images:

Example batch: 
{'label': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]}

The weird part of it all is that a sample code runs in a jupyter lab notebook without any bugs, but when I run my scripts from the terminal I get the bug. The same code.

@mariosasko
Copy link
Collaborator

The remove_columns=["image"] argument in the .map call removes the image column from the output, so drop this argument to preserve it.

@etetteh
Copy link
Author

etetteh commented Jul 26, 2023

The problem is not with the removal of the image key. The bug is why only the labels are sent to be process, instead of all the featues or dictionary keys.

P.S. I just dropped the removal argument as you've suggested, but that didn't solve the problem, because only the labels are being sent to be processed

@mariosasko
Copy link
Collaborator

All the image_dataset.column_names after the map call should also be present in preprocess_train /preprocess_val unless (input) columns in set_transform are specified.

If that's not the case, we need a full reproducer (not snippets) with the environment info.

@etetteh
Copy link
Author

etetteh commented Jul 27, 2023

I have resolved the error after including a collate function as indicated in the Quick Start session of the Datasets docs.:

Here is what I did:

data_transforms = utils.get_data_augmentation(args)
image_dataset = utils.load_image_dataset(args.dataset)

def preprocess_train(example_batch):
    example_batch["pixel_values"] = [
        data_transforms["train"](image.convert("RGB")) for image in example_batch["image"]
    ]
    return example_batch

def preprocess_val(example_batch):
    example_batch["pixel_values"] = [
        data_transforms["val"](image.convert("RGB")) for image in example_batch["image"]
    ]
    return example_batch

def collate_fn(examples):
    images = []
    labels = []
    for example in examples:
        images.append((example["pixel_values"]))
        labels.append(example["label"])

    pixel_values = torch.stack(images)
    labels = torch.tensor(labels)
    return {"pixel_values": pixel_values, "label": labels}

train_dataset = image_dataset["train"].with_transform(preprocess_train)
val_dataset = image_dataset["validation"].with_transform(preprocess_val)

image_datasets = {
    "train": train_dataset,
    "val": val_dataset
}

samplers = {
    "train": data.RandomSampler(train_dataset),
    "val": data.SequentialSampler(val_dataset),
}

dataloaders = {
    x: data.DataLoader(
        image_datasets[x],
        collate_fn=collate_fn,
        batch_size=batch_size,
        sampler=samplers[x],
        num_workers=args.num_workers,
        worker_init_fn=utils.set_seed_for_worker,
        generator=g,
        pin_memory=True,
    )
    for x in ["train", "val"]
}

train_loader, val_loader = dataloaders["train"], dataloaders["val"]

Everything runs fine without any bug now.

@etetteh etetteh closed this as completed Jul 27, 2023
@sralvins
Copy link

sralvins commented Sep 6, 2024

are you using hf Trainer? hf trainer will remove columns not used in model.forward. set remove_unused_columns=False might works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants