Skip to content

Commit

Permalink
Merge branch 'main' into rm-deprecated
Browse files Browse the repository at this point in the history
  • Loading branch information
albertvillanova committed Aug 21, 2024
2 parents f91ca55 + 9ddea80 commit 5c5e1bd
Show file tree
Hide file tree
Showing 70 changed files with 1,729 additions and 1,586 deletions.
27 changes: 27 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -96,3 +96,30 @@ jobs:
- name: Test with pytest
run: |
python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/
test_py310_numpy2:
needs: check_code_quality
strategy:
matrix:
test: ['unit']
os: [ubuntu-latest, windows-latest]
deps_versions: [deps-latest]
continue-on-error: false
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Set up Python 3.10
uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: Upgrade pip
run: python -m pip install --upgrade pip
- name: Install uv
run: pip install --upgrade uv
- name: Install dependencies
run: uv pip install --system "datasets[tests_numpy2] @ ."
- name: Test with pytest
run: |
python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/
2 changes: 0 additions & 2 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,4 @@
title: Table Classes
- local: package_reference/utilities
title: Utilities
- local: package_reference/task_templates
title: Task templates
title: "Reference"
10 changes: 3 additions & 7 deletions docs/source/about_mapstyle_vs_iterable.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ It provides even faster data loading when iterating using a `for` loop by iterat
However as soon as your [`Dataset`] has an indices mapping (via [`Dataset.shuffle`] for example), the speed can become 10x slower.
This is because there is an extra step to get the row index to read using the indices mapping, and most importantly, you aren't reading contiguous chunks of data anymore.
To restore the speed, you'd need to rewrite the entire dataset on your disk again using [`Dataset.flatten_indices`], which removes the indices mapping.
This may take a lot of time depending of the size of your dataset though:
This may take a lot of time depending on the size of your dataset though:

```python
my_dataset[0] # fast
Expand Down Expand Up @@ -205,19 +205,15 @@ for epoch in range(n_epochs):
pass
```

## Checkpoint and resuming differences

If you training loop stops, you may want to restart the training from where it was. To do so you can save a checkpoint of your model and optimizers, as well as your data loader.

To restart the iteration of a map-style dataset, you can simply skip the first examples:

```python
my_dataset = my_dataset.select(range(start_index, len(dataset)))
```

But if you use a `DataLoader` with a `Sampler`, you should instead save the state of your sampler (you might have write a custom sampler that allows resuming).
But if you use a `DataLoader` with a `Sampler`, you should instead save the state of your sampler (you might have written a custom sampler that allows resuming).

On the other hand, iterable datasets don't provide random access to a specific example inde to resume from. But you can use [`IterableDataset.state_dict`] and [`IterableDataset.load_state_dict`] to resume from a checkpoint instead, similarly to what you can do for models and optimizers:
On the other hand, iterable datasets don't provide random access to a specific example index to resume from. But you can use [`IterableDataset.state_dict`] and [`IterableDataset.load_state_dict`] to resume from a checkpoint instead, similarly to what you can do for models and optimizers:

```python
>>> iterable_dataset = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3)
Expand Down
23 changes: 15 additions & 8 deletions docs/source/create_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,19 @@ In this tutorial, you'll learn how to use 🤗 Datasets low-code methods for cre
* Folder-based builders for quickly creating an image or audio dataset
* `from_` methods for creating datasets from local files

## File-based builders

🤗 Datasets supports many common formats such as `csv`, `json/jsonl`, `parquet`, `txt`.

For example it can read a dataset made up of one or several CSV files (in this case, pass your CSV files as a list):

```py
>>> from datasets import load_dataset
>>> dataset = load_dataset("csv", data_files="my_file.csv")
```

To get the list of supported formats and code examples, follow this guide [here](https://huggingface.co/docs/datasets/loading#local-and-remote-files).

## Folder-based builders

There are two folder-based builders, [`ImageFolder`] and [`AudioFolder`]. These are low-code methods for quickly creating an image or speech and audio dataset with several thousand examples. They are great for rapidly prototyping computer vision and speech models before scaling to a larger dataset. Folder-based builders takes your data and automatically generates the dataset's features, splits, and labels. Under the hood:
Expand Down Expand Up @@ -61,9 +74,9 @@ squirtle.png, When it retracts its long neck into its shell, it squirts out wate

To learn more about each of these folder-based builders, check out the and <a href="https://huggingface.co/docs/datasets/image_dataset#imagefolder"><span class="underline decoration-yellow-400 decoration-2 font-semibold">ImageFolder</span></a> or <a href="https://huggingface.co/docs/datasets/audio_dataset#audiofolder"><span class="underline decoration-pink-400 decoration-2 font-semibold">AudioFolder</span></a> guides.

## From local files
## From Python dictionaries

You can also create a dataset from local files by specifying the path to the data files. There are two ways you can create a dataset using the `from_` methods:
You can also create a dataset from data in Python dictionaries. There are two ways you can create a dataset using the `from_` methods:

* The [`~Dataset.from_generator`] method is the most memory-efficient way to create a dataset from a [generator](https://wiki.python.org/moin/Generators) due to a generators iterative behavior. This is especially useful when you're working with a really large dataset that may not fit in memory, since the dataset is generated on disk progressively and then memory-mapped.

Expand Down Expand Up @@ -103,10 +116,4 @@ You can also create a dataset from local files by specifying the path to the dat
>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
```
## Next steps
We didn't mention this in the tutorial, but you can also create a dataset with a loading script. A loading script is a more manual and code-intensive method for creating a dataset, and are not well supported on Hugging Face. Though in some rare cases it can still be helpful.
To learn more about how to write loading scripts, take a look at the <a href="https://huggingface.co/docs/datasets/main/en/image_dataset#loading-script"><span class="underline decoration-yellow-400 decoration-2 font-semibold">image loading script</span></a>, <a href="https://huggingface.co/docs/datasets/main/en/audio_dataset"><span class="underline decoration-pink-400 decoration-2 font-semibold">audio loading script</span></a>, and <a href="https://huggingface.co/docs/datasets/main/en/dataset_script"><span class="underline decoration-green-400 decoration-2 font-semibold">text loading script</span></a> guides.
Now that you know how to create a dataset, consider sharing it on the Hub so the community can also benefit from your work! Go on to the next section to learn how to share your dataset.
2 changes: 1 addition & 1 deletion docs/source/dataset_card.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Each dataset should have a dataset card to promote responsible usage and inform
This idea was inspired by the Model Cards proposed by [Mitchell, 2018](https://arxiv.org/abs/1810.03993).
Dataset cards help users understand a dataset's contents, the context for using the dataset, how it was created, and any other considerations a user should be aware of.

Creating a dataset card is easy and can be done in a just a few steps:
Creating a dataset card is easy and can be done in just a few steps:

1. Go to your dataset repository on the [Hub](https://hf.co/new-dataset) and click on **Create Dataset Card** to create a new `README.md` file in your repository.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/faiss_es.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Search index

[FAISS](https://github.com/facebookresearch/faiss) and [Elasticsearch](https://www.elastic.co/elasticsearch/) enables searching for examples in a dataset. This can be useful when you want to retrieve specific examples from a dataset that are relevant to your NLP task. For example, if you are working on a Open Domain Question Answering task, you may want to only return examples that are relevant to answering your question.
[FAISS](https://github.com/facebookresearch/faiss) and [Elasticsearch](https://www.elastic.co/elasticsearch/) enables searching for examples in a dataset. This can be useful when you want to retrieve specific examples from a dataset that are relevant to your NLP task. For example, if you are working on an Open Domain Question Answering task, you may want to only return examples that are relevant to answering your question.

This guide will show you how to build an index for your dataset that will allow you to search it.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/image_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -345,7 +345,7 @@ def _info(self):
homepage=_HOMEPAGE,
citation=_CITATION,
license=_LICENSE,
task_templates=[ImageClassification(image_column="image", label_column="label")],

)
```

Expand Down
2 changes: 1 addition & 1 deletion docs/source/load_hub.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ DatasetDict({

## Configurations

Some datasets contain several sub-datasets. For example, the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset has several sub-datasets, each one containing audio data in a different language. These sub-datasets are known as *configurations*, and you must explicitly select one when loading the dataset. If you don't provide a configuration name, 🤗 Datasets will raise a `ValueError` and remind you to choose a configuration.
Some datasets contain several sub-datasets. For example, the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset has several sub-datasets, each one containing audio data in a different language. These sub-datasets are known as *configurations* or *subsets*, and you must explicitly select one when loading the dataset. If you don't provide a configuration name, 🤗 Datasets will raise a `ValueError` and remind you to choose a configuration.

Use the [`get_dataset_config_names`] function to retrieve a list of all the possible configurations available to your dataset:

Expand Down
21 changes: 17 additions & 4 deletions docs/source/package_reference/main_classes.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,6 @@ The base class [`Dataset`] implements a Dataset backed by an Apache Arrow table.
- from_parquet
- from_text
- from_sql
- prepare_for_task
- align_labels_with_mapping

[[autodoc]] datasets.concatenate_datasets
Expand Down Expand Up @@ -150,7 +149,6 @@ It also has dataset transform methods like map or filter, to process all the spl
- from_json
- from_parquet
- from_text
- prepare_for_task

<a id='package_reference_features'></a>

Expand All @@ -170,6 +168,7 @@ The base class [`IterableDataset`] implements an iterable Dataset backed by pyth
- rename_column
- filter
- shuffle
- batch
- skip
- take
- load_state_dict
Expand Down Expand Up @@ -210,16 +209,26 @@ Dictionary with split names as keys ('train', 'test' for example), and `Iterable

[[autodoc]] datasets.Features

[[autodoc]] datasets.Sequence
### Scalar

[[autodoc]] datasets.Value

[[autodoc]] datasets.ClassLabel

[[autodoc]] datasets.Value
### Composite

[[autodoc]] datasets.LargeList

[[autodoc]] datasets.Sequence

### Translation

[[autodoc]] datasets.Translation

[[autodoc]] datasets.TranslationVariableLanguages

### Arrays

[[autodoc]] datasets.Array2D

[[autodoc]] datasets.Array3D
Expand All @@ -228,8 +237,12 @@ Dictionary with split names as keys ('train', 'test' for example), and `Iterable

[[autodoc]] datasets.Array5D

### Audio

[[autodoc]] datasets.Audio

### Image

[[autodoc]] datasets.Image

## Filesystems
Expand Down
25 changes: 0 additions & 25 deletions docs/source/package_reference/task_templates.mdx

This file was deleted.

26 changes: 26 additions & 0 deletions docs/source/process.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -546,6 +546,32 @@ The following example shows how you can use `torch.distributed.barrier` to synch
... torch.distributed.barrier()
```

## Batch

The [`~Dataset.batch`] method allows you to group samples from the dataset into batches. This is particularly useful when you want to create batches of data for training or evaluation, especially when working with deep learning models.

Here's an example of how to use the `batch()` method:

```python
>>> from datasets import load_dataset
>>> dataset = load_dataset("rotten_tomatoes", split="train")
>>> batched_dataset = dataset.batch(batch_size=4)
>>> batched_dataset[0]
{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
'effective but too-tepid biopic',
'if you sometimes like to go to the movies to have fun , wasabi is a good place to start .'],
'label': [1, 1, 1, 1]}
```

The `batch()` method accepts the following parameters:

- `batch_size` (`int`): The number of samples in each batch.
- `drop_last_batch` (`bool`, defaults to `False`): Whether to drop the last incomplete batch if the dataset size is not divisible by the batch size.
- `num_proc` (`int`, optional, defaults to `None`): The number of processes to use for multiprocessing. If None, no multiprocessing is used. This can significantly speed up batching for large datasets.

Note that `Dataset.batch()` returns a new [`Dataset`] where each item is a batch of multiple samples from the original dataset. If you want to process data in batches, you should use a batched [`~Dataset.map`] directly, which applies a function to batches but the output dataset is unbatched.

## Concatenate

Separate datasets can be concatenated if they share the same column types. Concatenate datasets with [`concatenate_datasets`]:
Expand Down
38 changes: 38 additions & 0 deletions docs/source/stream.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,44 @@ You can filter rows in the dataset based on a predicate function using [`Dataset
{'id': 4, 'text': 'Are you looking for Number the Stars (Essential Modern Classics)? Normally, ...'}]
```

## Batch

The `batch` method transforms your `IterableDataset` into an iterable of batches. This is particularly useful when you want to work with batches in your training loop or when using frameworks that expect batched inputs.

<Tip>

There is also a "Batch Processing" option when using the `map` function to apply a function to batches of data, which is discussed in the [Map section](#map) above. The `batch` method described here is different and provides a more direct way to create batches from your dataset.

</Tip>

You can use the `batch` method like this:

```python
from datasets import load_dataset

# Load a dataset in streaming mode
dataset = load_dataset("some_dataset", split="train", streaming=True)

# Create batches of 32 samples
batched_dataset = dataset.batch(batch_size=32)

# Iterate over the batched dataset
for batch in batched_dataset:
print(batch)
break
```

In this example, batched_dataset is still an IterableDataset, but each item yielded is now a batch of 32 samples instead of a single sample.
This batching is done on-the-fly as you iterate over the dataset, preserving the memory-efficient nature of IterableDataset.

The batch method also provides a drop_last_batch parameter.
When set to True, it will discard the last batch if it's smaller than the specified batch_size.
This can be useful in scenarios where your downstream processing requires all batches to be of the same size:

```python
batched_dataset = dataset.batch(batch_size=32, drop_last_batch=True)
```

## Stream in a training loop

[`IterableDataset`] can be integrated into a training loop. First, shuffle the dataset:
Expand Down
4 changes: 2 additions & 2 deletions docs/source/use_with_jax.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ True
Note that if the `device` argument is not provided to `with_format` then it will use the default
device which is `jax.devices()[0]`.

## N-dimensional arrays
### N-dimensional arrays

If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same tensor if the shape is fixed:

Expand Down Expand Up @@ -120,7 +120,7 @@ To avoid this, you must explicitly use the [`Array`] feature type and specify th
[7, 8]]], dtype=int32)}
```

## Other feature types
### Other feature types

[`ClassLabel`] data is properly converted to arrays:

Expand Down
6 changes: 4 additions & 2 deletions docs/source/use_with_pytorch.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ To load the data as tensors on a GPU, specify the `device` argument:
{'data': tensor([1, 2], device='cuda:0')}
```

## N-dimensional arrays
### N-dimensional arrays

If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same tensor if the shape is fixed:

Expand Down Expand Up @@ -82,7 +82,7 @@ To avoid this, you must explicitly use the [`Array`] feature type and specify th
```


## Other feature types
### Other feature types

[`ClassLabel`] data are properly converted to tensors:

Expand Down Expand Up @@ -223,6 +223,8 @@ If the dataset is split in several shards (i.e. if the dataset consists of multi

In this case each worker is given a subset of the list of shards to stream from.

### Checkpoint and resume

If you need a DataLoader that you can checkpoint and resume in the middle of training, you can use the `StatefulDataLoader` from [torchdata](https://github.com/pytorch/data):

```py
Expand Down
4 changes: 2 additions & 2 deletions docs/source/use_with_tensorflow.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ array([[1, 2],
[3, 4]])>}
```

## N-dimensional arrays
### N-dimensional arrays

If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same tensor if the shape is fixed:

Expand Down Expand Up @@ -88,7 +88,7 @@ To avoid this, you must explicitly use the [`Array`] feature type and specify th
```


## Other feature types
### Other feature types

[`ClassLabel`] data are properly converted to tensors:

Expand Down
Loading

0 comments on commit 5c5e1bd

Please sign in to comment.