Skip to content

Commit

Permalink
Merge branch 'main' into rm-tasks
Browse files Browse the repository at this point in the history
  • Loading branch information
albertvillanova committed Aug 21, 2024
2 parents 304eeb7 + 90b1d94 commit 879cfce
Show file tree
Hide file tree
Showing 184 changed files with 1,873 additions and 17,048 deletions.
33 changes: 30 additions & 3 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ jobs:
pip install .[quality]
- name: Check quality
run: |
ruff check tests src benchmarks metrics utils setup.py # linter
ruff format --check tests src benchmarks metrics utils setup.py # formatter
ruff check tests src benchmarks utils setup.py # linter
ruff format --check tests src benchmarks utils setup.py # formatter
test:
needs: check_code_quality
Expand All @@ -56,7 +56,7 @@ jobs:
- name: Install uv
run: pip install --upgrade uv
- name: Install dependencies
run: uv pip install --system "datasets[tests,metrics-tests] @ ."
run: uv pip install --system "datasets[tests] @ ."
- name: Install dependencies (latest versions)
if: ${{ matrix.os == 'ubuntu-latest' }}
run: uv pip install --system -r additional-tests-requirements.txt --no-deps
Expand Down Expand Up @@ -96,3 +96,30 @@ jobs:
- name: Test with pytest
run: |
python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/
test_py310_numpy2:
needs: check_code_quality
strategy:
matrix:
test: ['unit']
os: [ubuntu-latest, windows-latest]
deps_versions: [deps-latest]
continue-on-error: false
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Set up Python 3.10
uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: Upgrade pip
run: python -m pip install --upgrade pip
- name: Install uv
run: pip install --upgrade uv
- name: Install dependencies
run: uv pip install --system "datasets[tests_numpy2] @ ."
- name: Test with pytest
run: |
python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/
7 changes: 0 additions & 7 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -42,13 +42,6 @@ venv.bak/
.idea
.vscode

# keep only the empty datasets and metrics directory with it's __init__.py file
/src/*/datasets/*
!/src/*/datasets/__init__.py

/src/*/metrics/*
!/src/*/metrics/__init__.py

# Vim
.*.swp

Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.PHONY: quality style test

check_dirs := tests src benchmarks metrics utils
check_dirs := tests src benchmarks utils

# Check that source code meets quality standards

Expand Down
4 changes: 0 additions & 4 deletions additional-tests-requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1 @@
unbabel-comet>=1.0.0
git+https://github.com/pytorch/data.git
git+https://github.com/google-research/bleurt.git
git+https://github.com/ns-moosavi/coval.git
git+https://github.com/hendrycks/math.git
1 change: 0 additions & 1 deletion docs/source/_redirects.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ splits: loading#slice-splits
processing: process
faiss_and_ea: faiss_es
features: about_dataset_features
using_metrics: how_to_metrics
exploring: access
package_reference/logging_methods: package_reference/utilities
# end of first_section
8 changes: 0 additions & 8 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,6 @@
title: Know your dataset
- local: use_dataset
title: Preprocess
- local: metrics
title: Evaluate predictions
- local: create_dataset
title: Create a dataset
- local: upload_dataset
Expand Down Expand Up @@ -48,10 +46,6 @@
title: Search index
- local: cli
title: CLI
- local: how_to_metrics
title: Metrics
- local: beam
title: Beam Datasets
- local: troubleshoot
title: Troubleshooting
title: "General usage"
Expand Down Expand Up @@ -113,8 +107,6 @@
title: Build and load
- local: about_map_batch
title: Batch mapping
- local: about_metrics
title: All about metrics
title: "Conceptual guides"
- sections:
- local: package_reference/main_classes
Expand Down
10 changes: 3 additions & 7 deletions docs/source/about_mapstyle_vs_iterable.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ It provides even faster data loading when iterating using a `for` loop by iterat
However as soon as your [`Dataset`] has an indices mapping (via [`Dataset.shuffle`] for example), the speed can become 10x slower.
This is because there is an extra step to get the row index to read using the indices mapping, and most importantly, you aren't reading contiguous chunks of data anymore.
To restore the speed, you'd need to rewrite the entire dataset on your disk again using [`Dataset.flatten_indices`], which removes the indices mapping.
This may take a lot of time depending of the size of your dataset though:
This may take a lot of time depending on the size of your dataset though:

```python
my_dataset[0] # fast
Expand Down Expand Up @@ -205,19 +205,15 @@ for epoch in range(n_epochs):
pass
```

## Checkpoint and resuming differences

If you training loop stops, you may want to restart the training from where it was. To do so you can save a checkpoint of your model and optimizers, as well as your data loader.

To restart the iteration of a map-style dataset, you can simply skip the first examples:

```python
my_dataset = my_dataset.select(range(start_index, len(dataset)))
```

But if you use a `DataLoader` with a `Sampler`, you should instead save the state of your sampler (you might have write a custom sampler that allows resuming).
But if you use a `DataLoader` with a `Sampler`, you should instead save the state of your sampler (you might have written a custom sampler that allows resuming).

On the other hand, iterable datasets don't provide random access to a specific example inde to resume from. But you can use [`IterableDataset.state_dict`] and [`IterableDataset.load_state_dict`] to resume from a checkpoint instead, similarly to what you can do for models and optimizers:
On the other hand, iterable datasets don't provide random access to a specific example index to resume from. But you can use [`IterableDataset.state_dict`] and [`IterableDataset.load_state_dict`] to resume from a checkpoint instead, similarly to what you can do for models and optimizers:

```python
>>> iterable_dataset = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3)
Expand Down
25 changes: 0 additions & 25 deletions docs/source/about_metrics.mdx

This file was deleted.

4 changes: 1 addition & 3 deletions docs/source/audio_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,6 @@ There are several methods for creating and sharing an audio dataset:

* Create an audio dataset repository with the `AudioFolder` builder. This is a no-code solution for quickly creating an audio dataset with several thousand audio files.

* Create an audio dataset by writing a loading script. This method is for advanced users and requires more effort and coding, but you have greater flexibility over how a dataset is defined, downloaded, and generated which can be useful for more complex or large scale audio datasets.


<Tip>

Expand Down Expand Up @@ -175,7 +173,7 @@ Some audio datasets, like those found in [Kaggle competitions](https://www.kaggl

</Tip>

## Loading script
## (Legacy) Loading script

Write a dataset loading script to manually create a dataset.
It defines a dataset's splits and configurations, and handles downloading and generating the dataset examples.
Expand Down
52 changes: 0 additions & 52 deletions docs/source/beam.mdx

This file was deleted.

20 changes: 0 additions & 20 deletions docs/source/cache.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,6 @@ When you load a dataset, you also have the option to change where the data is ca
>>> dataset = load_dataset('LOADING_SCRIPT', cache_dir="PATH/TO/MY/CACHE/DIR")
```

Similarly, you can change where a metric is cached with the `cache_dir` parameter:

```py
>>> from datasets import load_metric
>>> metric = load_metric('glue', 'mrpc', cache_dir="MY/CACHE/DIRECTORY")
```

## Download mode

After you download a dataset, control how it is loaded by [`load_dataset`] with the `download_mode` parameter. By default, 🤗 Datasets will reuse a dataset if it exists. But if you need the original dataset without any processing functions applied, re-download the files as shown below:
Expand Down Expand Up @@ -77,19 +70,6 @@ If you want to reuse a dataset from scratch, try setting the `download_mode` par

</Tip>

You can also avoid caching your metric entirely, and keep it in CPU memory instead:

```py
>>> from datasets import load_metric
>>> metric = load_metric('glue', 'mrpc', keep_in_memory=True)
```

<Tip warning={true}>

Keeping the predictions in-memory is not possible in a distributed setting since the CPU memory spaces of the various processes are not shared.

</Tip>

<a id='load_dataset_enhancing_performance'></a>

## Improve performance
Expand Down
3 changes: 1 addition & 2 deletions docs/source/cli.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,11 @@ You can check the available commands:
usage: datasets-cli <command> [<args>]

positional arguments:
{convert,env,test,run_beam,dummy_data,convert_to_parquet}
{convert,env,test,dummy_data,convert_to_parquet}
datasets-cli command helpers
convert Convert a TensorFlow Datasets dataset to a HuggingFace Datasets dataset.
env Print relevant system environment info.
test Test dataset implementation.
run_beam Run a Beam dataset processing pipeline
dummy_data Generate dummy data.
convert_to_parquet Convert dataset to Parquet
delete_from_hub Delete dataset config from the Hub
Expand Down
23 changes: 15 additions & 8 deletions docs/source/create_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,19 @@ In this tutorial, you'll learn how to use 🤗 Datasets low-code methods for cre
* Folder-based builders for quickly creating an image or audio dataset
* `from_` methods for creating datasets from local files

## File-based builders

🤗 Datasets supports many common formats such as `csv`, `json/jsonl`, `parquet`, `txt`.

For example it can read a dataset made up of one or several CSV files (in this case, pass your CSV files as a list):

```py
>>> from datasets import load_dataset
>>> dataset = load_dataset("csv", data_files="my_file.csv")
```

To get the list of supported formats and code examples, follow this guide [here](https://huggingface.co/docs/datasets/loading#local-and-remote-files).

## Folder-based builders

There are two folder-based builders, [`ImageFolder`] and [`AudioFolder`]. These are low-code methods for quickly creating an image or speech and audio dataset with several thousand examples. They are great for rapidly prototyping computer vision and speech models before scaling to a larger dataset. Folder-based builders takes your data and automatically generates the dataset's features, splits, and labels. Under the hood:
Expand Down Expand Up @@ -61,9 +74,9 @@ squirtle.png, When it retracts its long neck into its shell, it squirts out wate

To learn more about each of these folder-based builders, check out the and <a href="https://huggingface.co/docs/datasets/image_dataset#imagefolder"><span class="underline decoration-yellow-400 decoration-2 font-semibold">ImageFolder</span></a> or <a href="https://huggingface.co/docs/datasets/audio_dataset#audiofolder"><span class="underline decoration-pink-400 decoration-2 font-semibold">AudioFolder</span></a> guides.

## From local files
## From Python dictionaries

You can also create a dataset from local files by specifying the path to the data files. There are two ways you can create a dataset using the `from_` methods:
You can also create a dataset from data in Python dictionaries. There are two ways you can create a dataset using the `from_` methods:

* The [`~Dataset.from_generator`] method is the most memory-efficient way to create a dataset from a [generator](https://wiki.python.org/moin/Generators) due to a generators iterative behavior. This is especially useful when you're working with a really large dataset that may not fit in memory, since the dataset is generated on disk progressively and then memory-mapped.

Expand Down Expand Up @@ -103,10 +116,4 @@ You can also create a dataset from local files by specifying the path to the dat
>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
```
## Next steps
We didn't mention this in the tutorial, but you can also create a dataset with a loading script. A loading script is a more manual and code-intensive method for creating a dataset, but it also gives you the most flexibility and control over how a dataset is generated. It lets you configure additional options such as creating multiple configurations within a dataset, or enabling your dataset to be streamed.
To learn more about how to write loading scripts, take a look at the <a href="https://huggingface.co/docs/datasets/main/en/image_dataset#loading-script"><span class="underline decoration-yellow-400 decoration-2 font-semibold">image loading script</span></a>, <a href="https://huggingface.co/docs/datasets/main/en/audio_dataset"><span class="underline decoration-pink-400 decoration-2 font-semibold">audio loading script</span></a>, and <a href="https://huggingface.co/docs/datasets/main/en/dataset_script"><span class="underline decoration-green-400 decoration-2 font-semibold">text loading script</span></a> guides.
Now that you know how to create a dataset, consider sharing it on the Hub so the community can also benefit from your work! Go on to the next section to learn how to share your dataset.
2 changes: 1 addition & 1 deletion docs/source/dataset_card.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Each dataset should have a dataset card to promote responsible usage and inform
This idea was inspired by the Model Cards proposed by [Mitchell, 2018](https://arxiv.org/abs/1810.03993).
Dataset cards help users understand a dataset's contents, the context for using the dataset, how it was created, and any other considerations a user should be aware of.

Creating a dataset card is easy and can be done in a just a few steps:
Creating a dataset card is easy and can be done in just a few steps:

1. Go to your dataset repository on the [Hub](https://hf.co/new-dataset) and click on **Create Dataset Card** to create a new `README.md` file in your repository.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/faiss_es.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Search index

[FAISS](https://github.com/facebookresearch/faiss) and [Elasticsearch](https://www.elastic.co/elasticsearch/) enables searching for examples in a dataset. This can be useful when you want to retrieve specific examples from a dataset that are relevant to your NLP task. For example, if you are working on a Open Domain Question Answering task, you may want to only return examples that are relevant to answering your question.
[FAISS](https://github.com/facebookresearch/faiss) and [Elasticsearch](https://www.elastic.co/elasticsearch/) enables searching for examples in a dataset. This can be useful when you want to retrieve specific examples from a dataset that are relevant to your NLP task. For example, if you are working on an Open Domain Question Answering task, you may want to only return examples that are relevant to answering your question.

This guide will show you how to build an index for your dataset that will allow you to search it.

Expand Down
8 changes: 0 additions & 8 deletions docs/source/filesystems.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -142,14 +142,6 @@ Load a dataset builder from the Hugging Face Hub (see [how to load from the Hugg
>>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")
```

Load a dataset builder using a loading script (see [how to load a local loading script](./loading#local-loading-script)):

```py
>>> output_dir = "s3://my-bucket/imdb"
>>> builder = load_dataset_builder("path/to/local/loading_script/loading_script.py")
>>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")
```

Use your own data files (see [how to load local and remote files](./loading#local-and-remote-files)):

```py
Expand Down
Loading

0 comments on commit 879cfce

Please sign in to comment.