Merge branch 'main' into rm-tasks

huggingface · Aug 21, 2024 · 879cfce · 879cfce
2 parents 304eeb7 + 90b1d94
commit 879cfce
Show file tree

Hide file tree

Showing 184 changed files with 1,873 additions and 17,048 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -28,8 +28,8 @@ jobs:
           pip install .[quality]
       - name: Check quality
         run: |
-          ruff check tests src benchmarks metrics utils setup.py # linter
-          ruff format --check tests src benchmarks metrics utils setup.py # formatter
+          ruff check tests src benchmarks utils setup.py # linter
+          ruff format --check tests src benchmarks utils setup.py # formatter
 
   test:
     needs: check_code_quality
@@ -56,7 +56,7 @@ jobs:
       - name: Install uv
         run: pip install --upgrade uv
       - name: Install dependencies
-        run: uv pip install --system "datasets[tests,metrics-tests] @ ."
+        run: uv pip install --system "datasets[tests] @ ."
       - name: Install dependencies (latest versions)
         if: ${{ matrix.os == 'ubuntu-latest' }}
         run: uv pip install --system -r additional-tests-requirements.txt --no-deps
@@ -96,3 +96,30 @@ jobs:
       - name: Test with pytest
         run: |
           python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/
+
+  test_py310_numpy2:
+    needs: check_code_quality
+    strategy:
+      matrix:
+        test: ['unit']
+        os: [ubuntu-latest, windows-latest]
+        deps_versions: [deps-latest]
+    continue-on-error: false
+    runs-on: ${{ matrix.os }}
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+      - name: Set up Python 3.10
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.10"
+      - name: Upgrade pip
+        run: python -m pip install --upgrade pip
+      - name: Install uv
+        run: pip install --upgrade uv
+      - name: Install dependencies
+        run: uv pip install --system "datasets[tests_numpy2] @ ."
+      - name: Test with pytest
+        run: |
+          python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/
diff --git a/.gitignore b/.gitignore
@@ -42,13 +42,6 @@ venv.bak/
 .idea
 .vscode
 
-# keep only the empty datasets and metrics directory with it's __init__.py file
-/src/*/datasets/*
-!/src/*/datasets/__init__.py
-
-/src/*/metrics/*
-!/src/*/metrics/__init__.py
-
 # Vim
 .*.swp
 

diff --git a/Makefile b/Makefile
@@ -1,6 +1,6 @@
 .PHONY: quality style test
 
-check_dirs := tests src benchmarks metrics utils
+check_dirs := tests src benchmarks utils
 
 # Check that source code meets quality standards
 

diff --git a/additional-tests-requirements.txt b/additional-tests-requirements.txt
@@ -1,5 +1 @@
-unbabel-comet>=1.0.0
 git+https://github.com/pytorch/data.git
-git+https://github.com/google-research/bleurt.git
-git+https://github.com/ns-moosavi/coval.git
-git+https://github.com/hendrycks/math.git
diff --git a/docs/source/_redirects.yml b/docs/source/_redirects.yml
@@ -8,7 +8,6 @@ splits: loading#slice-splits
 processing: process
 faiss_and_ea: faiss_es
 features: about_dataset_features
-using_metrics: how_to_metrics
 exploring: access
 package_reference/logging_methods: package_reference/utilities
 # end of first_section
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -15,8 +15,6 @@
     title: Know your dataset
   - local: use_dataset
     title: Preprocess
-  - local: metrics
-    title: Evaluate predictions
   - local: create_dataset
     title: Create a dataset
   - local: upload_dataset
@@ -48,10 +46,6 @@
       title: Search index
     - local: cli
       title: CLI
-    - local: how_to_metrics
-      title: Metrics
-    - local: beam
-      title: Beam Datasets
     - local: troubleshoot
       title: Troubleshooting
     title: "General usage"
@@ -113,8 +107,6 @@
     title: Build and load
   - local: about_map_batch
     title: Batch mapping
-  - local: about_metrics
-    title: All about metrics
   title: "Conceptual guides"
 - sections:
   - local: package_reference/main_classes

diff --git a/docs/source/about_mapstyle_vs_iterable.mdx b/docs/source/about_mapstyle_vs_iterable.mdx
@@ -166,7 +166,7 @@ It provides even faster data loading when iterating using a `for` loop by iterat
 However as soon as your [`Dataset`] has an indices mapping (via [`Dataset.shuffle`] for example), the speed can become 10x slower.
 This is because there is an extra step to get the row index to read using the indices mapping, and most importantly, you aren't reading contiguous chunks of data anymore.
 To restore the speed, you'd need to rewrite the entire dataset on your disk again using [`Dataset.flatten_indices`], which removes the indices mapping.
-This may take a lot of time depending of the size of your dataset though:
+This may take a lot of time depending on the size of your dataset though:
 
 ```python
 my_dataset[0]  # fast
@@ -205,19 +205,15 @@ for epoch in range(n_epochs):
         pass
 ```
 
-## Checkpoint and resuming differences
-
-If you training loop stops, you may want to restart the training from where it was. To do so you can save a checkpoint of your model and optimizers, as well as your data loader.
-
 To restart the iteration of a map-style dataset, you can simply skip the first examples:
 
 ```python
 my_dataset = my_dataset.select(range(start_index, len(dataset)))
 ```
 
-But if you use a `DataLoader` with a `Sampler`, you should instead save the state of your sampler (you might have write a custom sampler that allows resuming).
+But if you use a `DataLoader` with a `Sampler`, you should instead save the state of your sampler (you might have written a custom sampler that allows resuming).
 
-On the other hand, iterable datasets don't provide random access to a specific example inde to resume from. But you can use [`IterableDataset.state_dict`] and [`IterableDataset.load_state_dict`] to resume from a checkpoint instead, similarly to what you can do for models and optimizers:
+On the other hand, iterable datasets don't provide random access to a specific example index to resume from. But you can use [`IterableDataset.state_dict`] and [`IterableDataset.load_state_dict`] to resume from a checkpoint instead, similarly to what you can do for models and optimizers:
 
 ```python
 >>> iterable_dataset = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3)

diff --git a/docs/source/about_metrics.mdx b/docs/source/about_metrics.mdx
diff --git a/docs/source/audio_dataset.mdx b/docs/source/audio_dataset.mdx
@@ -14,8 +14,6 @@ There are several methods for creating and sharing an audio dataset:
 
 * Create an audio dataset repository with the `AudioFolder` builder. This is a no-code solution for quickly creating an audio dataset with several thousand audio files.
 
-* Create an audio dataset by writing a loading script. This method is for advanced users and requires more effort and coding, but you have greater flexibility over how a dataset is defined, downloaded, and generated which can be useful for more complex or large scale audio datasets.
-
 
 <Tip>
 
@@ -175,7 +173,7 @@ Some audio datasets, like those found in [Kaggle competitions](https://www.kaggl
 
 </Tip>
 
-## Loading script
+## (Legacy) Loading script
 
 Write a dataset loading script to manually create a dataset.
 It defines a dataset's splits and configurations, and handles downloading and generating the dataset examples.

diff --git a/docs/source/beam.mdx b/docs/source/beam.mdx
diff --git a/docs/source/cache.mdx b/docs/source/cache.mdx
@@ -24,13 +24,6 @@ When you load a dataset, you also have the option to change where the data is ca
 >>> dataset = load_dataset('LOADING_SCRIPT', cache_dir="PATH/TO/MY/CACHE/DIR")
 ```
 
-Similarly, you can change where a metric is cached with the `cache_dir` parameter:
-
-```py
->>> from datasets import load_metric
->>> metric = load_metric('glue', 'mrpc', cache_dir="MY/CACHE/DIRECTORY")
-```
-
 ## Download mode
 
 After you download a dataset, control how it is loaded by [`load_dataset`] with the `download_mode` parameter. By default, 🤗 Datasets will reuse a dataset if it exists. But if you need the original dataset without any processing functions applied, re-download the files as shown below:
@@ -77,19 +70,6 @@ If you want to reuse a dataset from scratch, try setting the `download_mode` par
 
 </Tip>
 
-You can also avoid caching your metric entirely, and keep it in CPU memory instead:
-
-```py
->>> from datasets import load_metric
->>> metric = load_metric('glue', 'mrpc', keep_in_memory=True)
-```
-
-<Tip warning={true}>
-
-Keeping the predictions in-memory is not possible in a distributed setting since the CPU memory spaces of the various processes are not shared.
-
-</Tip>
-
 <a id='load_dataset_enhancing_performance'></a>
 
 ## Improve performance

diff --git a/docs/source/cli.mdx b/docs/source/cli.mdx
@@ -8,12 +8,11 @@ You can check the available commands:
 usage: datasets-cli <command> [<args>]
 
 positional arguments:
-  {convert,env,test,run_beam,dummy_data,convert_to_parquet}
+  {convert,env,test,dummy_data,convert_to_parquet}
                         datasets-cli command helpers
     convert             Convert a TensorFlow Datasets dataset to a HuggingFace Datasets dataset.
     env                 Print relevant system environment info.
     test                Test dataset implementation.
-    run_beam            Run a Beam dataset processing pipeline
     dummy_data          Generate dummy data.
     convert_to_parquet  Convert dataset to Parquet
     delete_from_hub     Delete dataset config from the Hub

diff --git a/docs/source/create_dataset.mdx b/docs/source/create_dataset.mdx
@@ -7,6 +7,19 @@ In this tutorial, you'll learn how to use 🤗 Datasets low-code methods for cre
 * Folder-based builders for quickly creating an image or audio dataset
 * `from_` methods for creating datasets from local files
 
+## File-based builders
+
+🤗 Datasets supports many common formats such as `csv`, `json/jsonl`, `parquet`, `txt`.
+
+For example it can read a dataset made up of one or several CSV files (in this case, pass your CSV files as a list):
+
+```py
+>>> from datasets import load_dataset
+>>> dataset = load_dataset("csv", data_files="my_file.csv")
+```
+
+To get the list of supported formats and code examples, follow this guide [here](https://huggingface.co/docs/datasets/loading#local-and-remote-files).
+
 ## Folder-based builders
 
 There are two folder-based builders, [`ImageFolder`] and [`AudioFolder`]. These are low-code methods for quickly creating an image or speech and audio dataset with several thousand examples. They are great for rapidly prototyping computer vision and speech models before scaling to a larger dataset. Folder-based builders takes your data and automatically generates the dataset's features, splits, and labels. Under the hood:
@@ -61,9 +74,9 @@ squirtle.png, When it retracts its long neck into its shell, it squirts out wate
 
 To learn more about each of these folder-based builders, check out the and <a href="https://huggingface.co/docs/datasets/image_dataset#imagefolder"><span class="underline decoration-yellow-400 decoration-2 font-semibold">ImageFolder</span></a> or <a href="https://huggingface.co/docs/datasets/audio_dataset#audiofolder"><span class="underline decoration-pink-400 decoration-2 font-semibold">AudioFolder</span></a> guides.
 
-## From local files
+## From Python dictionaries
 
-You can also create a dataset from local files by specifying the path to the data files. There are two ways you can create a dataset using the `from_` methods:
+You can also create a dataset from data in Python dictionaries. There are two ways you can create a dataset using the `from_` methods:
 
     * The [`~Dataset.from_generator`] method is the most memory-efficient way to create a dataset from a [generator](https://wiki.python.org/moin/Generators) due to a generators iterative behavior. This is especially useful when you're working with a really large dataset that may not fit in memory, since the dataset is generated on disk progressively and then memory-mapped.
 
@@ -103,10 +116,4 @@ You can also create a dataset from local files by specifying the path to the dat
     >>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
     ```
 
-## Next steps
-
-We didn't mention this in the tutorial, but you can also create a dataset with a loading script. A loading script is a more manual and code-intensive method for creating a dataset, but it also gives you the most flexibility and control over how a dataset is generated. It lets you configure additional options such as creating multiple configurations within a dataset, or enabling your dataset to be streamed. 
-
-To learn more about how to write loading scripts, take a look at the <a href="https://huggingface.co/docs/datasets/main/en/image_dataset#loading-script"><span class="underline decoration-yellow-400 decoration-2 font-semibold">image loading script</span></a>, <a href="https://huggingface.co/docs/datasets/main/en/audio_dataset"><span class="underline decoration-pink-400 decoration-2 font-semibold">audio loading script</span></a>, and <a href="https://huggingface.co/docs/datasets/main/en/dataset_script"><span class="underline decoration-green-400 decoration-2 font-semibold">text loading script</span></a> guides.
-
 Now that you know how to create a dataset, consider sharing it on the Hub so the community can also benefit from your work! Go on to the next section to learn how to share your dataset.
diff --git a/docs/source/dataset_card.mdx b/docs/source/dataset_card.mdx
@@ -4,7 +4,7 @@ Each dataset should have a dataset card to promote responsible usage and inform
 This idea was inspired by the Model Cards proposed by [Mitchell, 2018](https://arxiv.org/abs/1810.03993).
 Dataset cards help users understand a dataset's contents, the context for using the dataset, how it was created, and any other considerations a user should be aware of.
 
-Creating a dataset card is easy and can be done in a just a few steps:
+Creating a dataset card is easy and can be done in just a few steps:
 
 1. Go to your dataset repository on the [Hub](https://hf.co/new-dataset) and click on **Create Dataset Card** to create a new `README.md` file in your repository.
 

diff --git a/docs/source/faiss_es.mdx b/docs/source/faiss_es.mdx
@@ -1,6 +1,6 @@
 # Search index
 
-[FAISS](https://github.com/facebookresearch/faiss) and [Elasticsearch](https://www.elastic.co/elasticsearch/) enables searching for examples in a dataset. This can be useful when you want to retrieve specific examples from a dataset that are relevant to your NLP task. For example, if you are working on a Open Domain Question Answering task, you may want to only return examples that are relevant to answering your question.
+[FAISS](https://github.com/facebookresearch/faiss) and [Elasticsearch](https://www.elastic.co/elasticsearch/) enables searching for examples in a dataset. This can be useful when you want to retrieve specific examples from a dataset that are relevant to your NLP task. For example, if you are working on an Open Domain Question Answering task, you may want to only return examples that are relevant to answering your question.
 
 This guide will show you how to build an index for your dataset that will allow you to search it.
 

diff --git a/docs/source/filesystems.mdx b/docs/source/filesystems.mdx
@@ -142,14 +142,6 @@ Load a dataset builder from the Hugging Face Hub (see [how to load from the Hugg
 >>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")
 ```
 
-Load a dataset builder using a loading script (see [how to load a local loading script](./loading#local-loading-script)):
-
-```py
->>> output_dir = "s3://my-bucket/imdb"
->>> builder = load_dataset_builder("path/to/local/loading_script/loading_script.py")
->>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")
-```
-
 Use your own data files (see [how to load local and remote files](./loading#local-and-remote-files)):
 
 ```py