Merge branch 'main' into rm-deprecated

huggingface · Aug 21, 2024 · 5c5e1bd · 5c5e1bd
2 parents f91ca55 + 9ddea80
commit 5c5e1bd
Show file tree

Hide file tree

Showing 70 changed files with 1,729 additions and 1,586 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -96,3 +96,30 @@ jobs:
       - name: Test with pytest
         run: |
           python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/
+
+  test_py310_numpy2:
+    needs: check_code_quality
+    strategy:
+      matrix:
+        test: ['unit']
+        os: [ubuntu-latest, windows-latest]
+        deps_versions: [deps-latest]
+    continue-on-error: false
+    runs-on: ${{ matrix.os }}
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+      - name: Set up Python 3.10
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.10"
+      - name: Upgrade pip
+        run: python -m pip install --upgrade pip
+      - name: Install uv
+        run: pip install --upgrade uv
+      - name: Install dependencies
+        run: uv pip install --system "datasets[tests_numpy2] @ ."
+      - name: Test with pytest
+        run: |
+          python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -119,6 +119,4 @@
     title: Table Classes
   - local: package_reference/utilities
     title: Utilities
-  - local: package_reference/task_templates
-    title: Task templates
   title: "Reference"
diff --git a/docs/source/about_mapstyle_vs_iterable.mdx b/docs/source/about_mapstyle_vs_iterable.mdx
@@ -166,7 +166,7 @@ It provides even faster data loading when iterating using a `for` loop by iterat
 However as soon as your [`Dataset`] has an indices mapping (via [`Dataset.shuffle`] for example), the speed can become 10x slower.
 This is because there is an extra step to get the row index to read using the indices mapping, and most importantly, you aren't reading contiguous chunks of data anymore.
 To restore the speed, you'd need to rewrite the entire dataset on your disk again using [`Dataset.flatten_indices`], which removes the indices mapping.
-This may take a lot of time depending of the size of your dataset though:
+This may take a lot of time depending on the size of your dataset though:
 
 ```python
 my_dataset[0]  # fast
@@ -205,19 +205,15 @@ for epoch in range(n_epochs):
         pass
 ```
 
-## Checkpoint and resuming differences
-
-If you training loop stops, you may want to restart the training from where it was. To do so you can save a checkpoint of your model and optimizers, as well as your data loader.
-
 To restart the iteration of a map-style dataset, you can simply skip the first examples:
 
 ```python
 my_dataset = my_dataset.select(range(start_index, len(dataset)))
 ```
 
-But if you use a `DataLoader` with a `Sampler`, you should instead save the state of your sampler (you might have write a custom sampler that allows resuming).
+But if you use a `DataLoader` with a `Sampler`, you should instead save the state of your sampler (you might have written a custom sampler that allows resuming).
 
-On the other hand, iterable datasets don't provide random access to a specific example inde to resume from. But you can use [`IterableDataset.state_dict`] and [`IterableDataset.load_state_dict`] to resume from a checkpoint instead, similarly to what you can do for models and optimizers:
+On the other hand, iterable datasets don't provide random access to a specific example index to resume from. But you can use [`IterableDataset.state_dict`] and [`IterableDataset.load_state_dict`] to resume from a checkpoint instead, similarly to what you can do for models and optimizers:
 
 ```python
 >>> iterable_dataset = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3)

diff --git a/docs/source/create_dataset.mdx b/docs/source/create_dataset.mdx
@@ -7,6 +7,19 @@ In this tutorial, you'll learn how to use 🤗 Datasets low-code methods for cre
 * Folder-based builders for quickly creating an image or audio dataset
 * `from_` methods for creating datasets from local files
 
+## File-based builders
+
+🤗 Datasets supports many common formats such as `csv`, `json/jsonl`, `parquet`, `txt`.
+
+For example it can read a dataset made up of one or several CSV files (in this case, pass your CSV files as a list):
+
+```py
+>>> from datasets import load_dataset
+>>> dataset = load_dataset("csv", data_files="my_file.csv")
+```
+
+To get the list of supported formats and code examples, follow this guide [here](https://huggingface.co/docs/datasets/loading#local-and-remote-files).
+
 ## Folder-based builders
 
 There are two folder-based builders, [`ImageFolder`] and [`AudioFolder`]. These are low-code methods for quickly creating an image or speech and audio dataset with several thousand examples. They are great for rapidly prototyping computer vision and speech models before scaling to a larger dataset. Folder-based builders takes your data and automatically generates the dataset's features, splits, and labels. Under the hood:
@@ -61,9 +74,9 @@ squirtle.png, When it retracts its long neck into its shell, it squirts out wate
 
 To learn more about each of these folder-based builders, check out the and <a href="https://huggingface.co/docs/datasets/image_dataset#imagefolder"><span class="underline decoration-yellow-400 decoration-2 font-semibold">ImageFolder</span></a> or <a href="https://huggingface.co/docs/datasets/audio_dataset#audiofolder"><span class="underline decoration-pink-400 decoration-2 font-semibold">AudioFolder</span></a> guides.
 
-## From local files
+## From Python dictionaries
 
-You can also create a dataset from local files by specifying the path to the data files. There are two ways you can create a dataset using the `from_` methods:
+You can also create a dataset from data in Python dictionaries. There are two ways you can create a dataset using the `from_` methods:
 
     * The [`~Dataset.from_generator`] method is the most memory-efficient way to create a dataset from a [generator](https://wiki.python.org/moin/Generators) due to a generators iterative behavior. This is especially useful when you're working with a really large dataset that may not fit in memory, since the dataset is generated on disk progressively and then memory-mapped.
 
@@ -103,10 +116,4 @@ You can also create a dataset from local files by specifying the path to the dat
     >>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
     ```
 
-## Next steps
-
-We didn't mention this in the tutorial, but you can also create a dataset with a loading script. A loading script is a more manual and code-intensive method for creating a dataset, and are not well supported on Hugging Face. Though in some rare cases it can still be helpful.
-
-To learn more about how to write loading scripts, take a look at the <a href="https://huggingface.co/docs/datasets/main/en/image_dataset#loading-script"><span class="underline decoration-yellow-400 decoration-2 font-semibold">image loading script</span></a>, <a href="https://huggingface.co/docs/datasets/main/en/audio_dataset"><span class="underline decoration-pink-400 decoration-2 font-semibold">audio loading script</span></a>, and <a href="https://huggingface.co/docs/datasets/main/en/dataset_script"><span class="underline decoration-green-400 decoration-2 font-semibold">text loading script</span></a> guides.
-
 Now that you know how to create a dataset, consider sharing it on the Hub so the community can also benefit from your work! Go on to the next section to learn how to share your dataset.
diff --git a/docs/source/dataset_card.mdx b/docs/source/dataset_card.mdx
@@ -4,7 +4,7 @@ Each dataset should have a dataset card to promote responsible usage and inform
 This idea was inspired by the Model Cards proposed by [Mitchell, 2018](https://arxiv.org/abs/1810.03993).
 Dataset cards help users understand a dataset's contents, the context for using the dataset, how it was created, and any other considerations a user should be aware of.
 
-Creating a dataset card is easy and can be done in a just a few steps:
+Creating a dataset card is easy and can be done in just a few steps:
 
 1. Go to your dataset repository on the [Hub](https://hf.co/new-dataset) and click on **Create Dataset Card** to create a new `README.md` file in your repository.
 

diff --git a/docs/source/faiss_es.mdx b/docs/source/faiss_es.mdx
@@ -1,6 +1,6 @@
 # Search index
 
-[FAISS](https://github.com/facebookresearch/faiss) and [Elasticsearch](https://www.elastic.co/elasticsearch/) enables searching for examples in a dataset. This can be useful when you want to retrieve specific examples from a dataset that are relevant to your NLP task. For example, if you are working on a Open Domain Question Answering task, you may want to only return examples that are relevant to answering your question.
+[FAISS](https://github.com/facebookresearch/faiss) and [Elasticsearch](https://www.elastic.co/elasticsearch/) enables searching for examples in a dataset. This can be useful when you want to retrieve specific examples from a dataset that are relevant to your NLP task. For example, if you are working on an Open Domain Question Answering task, you may want to only return examples that are relevant to answering your question.
 
 This guide will show you how to build an index for your dataset that will allow you to search it.
 

diff --git a/docs/source/image_dataset.mdx b/docs/source/image_dataset.mdx
@@ -345,7 +345,7 @@ def _info(self):
         homepage=_HOMEPAGE,
         citation=_CITATION,
         license=_LICENSE,
-        task_templates=[ImageClassification(image_column="image", label_column="label")],
+
     )
 ```
 

diff --git a/docs/source/load_hub.mdx b/docs/source/load_hub.mdx
@@ -80,7 +80,7 @@ DatasetDict({
 
 ## Configurations
 
-Some datasets contain several sub-datasets. For example, the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset has several sub-datasets, each one containing audio data in a different language. These sub-datasets are known as *configurations*, and you must explicitly select one when loading the dataset. If you don't provide a configuration name, 🤗 Datasets will raise a `ValueError` and remind you to choose a configuration.
+Some datasets contain several sub-datasets. For example, the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset has several sub-datasets, each one containing audio data in a different language. These sub-datasets are known as *configurations* or *subsets*, and you must explicitly select one when loading the dataset. If you don't provide a configuration name, 🤗 Datasets will raise a `ValueError` and remind you to choose a configuration.
 
 Use the [`get_dataset_config_names`] function to retrieve a list of all the possible configurations available to your dataset:
 

diff --git a/docs/source/package_reference/main_classes.mdx b/docs/source/package_reference/main_classes.mdx
@@ -97,7 +97,6 @@ The base class [`Dataset`] implements a Dataset backed by an Apache Arrow table.
     - from_parquet
     - from_text
     - from_sql
-    - prepare_for_task
     - align_labels_with_mapping
 
 [[autodoc]] datasets.concatenate_datasets
@@ -150,7 +149,6 @@ It also has dataset transform methods like map or filter, to process all the spl
     - from_json
     - from_parquet
     - from_text
-    - prepare_for_task
 
 <a id='package_reference_features'></a>
 
@@ -170,6 +168,7 @@ The base class [`IterableDataset`] implements an iterable Dataset backed by pyth
     - rename_column
     - filter
     - shuffle
+    - batch
     - skip
     - take
     - load_state_dict
@@ -210,16 +209,26 @@ Dictionary with split names as keys ('train', 'test' for example), and `Iterable
 
 [[autodoc]] datasets.Features
 
-[[autodoc]] datasets.Sequence
+### Scalar
+
+[[autodoc]] datasets.Value
 
 [[autodoc]] datasets.ClassLabel
 
-[[autodoc]] datasets.Value
+### Composite
+
+[[autodoc]] datasets.LargeList
+
+[[autodoc]] datasets.Sequence
+
+### Translation
 
 [[autodoc]] datasets.Translation
 
 [[autodoc]] datasets.TranslationVariableLanguages
 
+### Arrays
+
 [[autodoc]] datasets.Array2D
 
 [[autodoc]] datasets.Array3D
@@ -228,8 +237,12 @@ Dictionary with split names as keys ('train', 'test' for example), and `Iterable
 
 [[autodoc]] datasets.Array5D
 
+### Audio
+
 [[autodoc]] datasets.Audio
 
+### Image
+
 [[autodoc]] datasets.Image
 
 ## Filesystems

diff --git a/docs/source/package_reference/task_templates.mdx b/docs/source/package_reference/task_templates.mdx
diff --git a/docs/source/process.mdx b/docs/source/process.mdx
@@ -546,6 +546,32 @@ The following example shows how you can use `torch.distributed.barrier` to synch
 ...     torch.distributed.barrier()
 ```
 
+## Batch
+
+The [`~Dataset.batch`] method allows you to group samples from the dataset into batches. This is particularly useful when you want to create batches of data for training or evaluation, especially when working with deep learning models.
+
+Here's an example of how to use the `batch()` method:
+
+```python
+>>> from datasets import load_dataset
+>>> dataset = load_dataset("rotten_tomatoes", split="train")
+>>> batched_dataset = dataset.batch(batch_size=4)
+>>> batched_dataset[0]
+{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
+        'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
+        'effective but too-tepid biopic',
+        'if you sometimes like to go to the movies to have fun , wasabi is a good place to start .'],
+'label': [1, 1, 1, 1]}
+```
+
+The `batch()` method accepts the following parameters:
+
+- `batch_size` (`int`): The number of samples in each batch.
+- `drop_last_batch` (`bool`, defaults to `False`): Whether to drop the last incomplete batch if the dataset size is not divisible by the batch size.
+- `num_proc` (`int`, optional, defaults to `None`): The number of processes to use for multiprocessing. If None, no multiprocessing is used. This can significantly speed up batching for large datasets.
+
+Note that `Dataset.batch()` returns a new [`Dataset`] where each item is a batch of multiple samples from the original dataset. If you want to process data in batches, you should use a batched [`~Dataset.map`] directly, which applies a function to batches but the output dataset is unbatched.
+
 ## Concatenate
 
 Separate datasets can be concatenated if they share the same column types. Concatenate datasets with [`concatenate_datasets`]:

diff --git a/docs/source/stream.mdx b/docs/source/stream.mdx
@@ -318,6 +318,44 @@ You can filter rows in the dataset based on a predicate function using [`Dataset
  {'id': 4, 'text': 'Are you looking for Number the Stars (Essential Modern Classics)? Normally, ...'}]
 ```
 
+## Batch
+
+The `batch` method transforms your `IterableDataset` into an iterable of batches. This is particularly useful when you want to work with batches in your training loop or when using frameworks that expect batched inputs.
+
+<Tip>
+
+There is also a "Batch Processing" option when using the `map` function to apply a function to batches of data, which is discussed in the [Map section](#map) above. The `batch` method described here is different and provides a more direct way to create batches from your dataset.
+
+</Tip>
+
+You can use the `batch` method like this:
+
+```python
+from datasets import load_dataset
+
+# Load a dataset in streaming mode
+dataset = load_dataset("some_dataset", split="train", streaming=True)
+
+# Create batches of 32 samples
+batched_dataset = dataset.batch(batch_size=32)
+
+# Iterate over the batched dataset
+for batch in batched_dataset:
+    print(batch)
+    break
+```
+
+In this example, batched_dataset is still an IterableDataset, but each item yielded is now a batch of 32 samples instead of a single sample.
+This batching is done on-the-fly as you iterate over the dataset, preserving the memory-efficient nature of IterableDataset.
+
+The batch method also provides a drop_last_batch parameter. 
+When set to True, it will discard the last batch if it's smaller than the specified batch_size. 
+This can be useful in scenarios where your downstream processing requires all batches to be of the same size:
+
+```python
+batched_dataset = dataset.batch(batch_size=32, drop_last_batch=True)
+```
+
 ## Stream in a training loop
 
 [`IterableDataset`] can be integrated into a training loop. First, shuffle the dataset:

diff --git a/docs/source/use_with_jax.mdx b/docs/source/use_with_jax.mdx
@@ -77,7 +77,7 @@ True
 Note that if the `device` argument is not provided to `with_format` then it will use the default
 device which is `jax.devices()[0]`.
 
-## N-dimensional arrays
+### N-dimensional arrays
 
 If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same tensor if the shape is fixed:
 
@@ -120,7 +120,7 @@ To avoid this, you must explicitly use the [`Array`] feature type and specify th
          [7, 8]]], dtype=int32)}
 ```
 
-## Other feature types
+### Other feature types
 
 [`ClassLabel`] data is properly converted to arrays:
 

diff --git a/docs/source/use_with_pytorch.mdx b/docs/source/use_with_pytorch.mdx
@@ -38,7 +38,7 @@ To load the data as tensors on a GPU, specify the `device` argument:
 {'data': tensor([1, 2], device='cuda:0')}
 ```
 
-## N-dimensional arrays
+### N-dimensional arrays
 
 If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same tensor if the shape is fixed:
 
@@ -82,7 +82,7 @@ To avoid this, you must explicitly use the [`Array`] feature type and specify th
 ```
 
 
-## Other feature types
+### Other feature types
 
 [`ClassLabel`] data are properly converted to tensors:
 
@@ -223,6 +223,8 @@ If the dataset is split in several shards (i.e. if the dataset consists of multi
 
 In this case each worker is given a subset of the list of shards to stream from.
 
+### Checkpoint and resume
+
 If you need a DataLoader that you can checkpoint and resume in the middle of training, you can use the `StatefulDataLoader` from [torchdata](https://github.com/pytorch/data):
 
 ```py

diff --git a/docs/source/use_with_tensorflow.mdx b/docs/source/use_with_tensorflow.mdx
@@ -39,7 +39,7 @@ array([[1, 2],
        [3, 4]])>}
 ```
 
-## N-dimensional arrays
+### N-dimensional arrays
 
 If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same tensor if the shape is fixed:
 
@@ -88,7 +88,7 @@ To avoid this, you must explicitly use the [`Array`] feature type and specify th
 ```
 
 
-## Other feature types
+### Other feature types
 
 [`ClassLabel`] data are properly converted to tensors: