[Refactor/Feat] load dataset (#254)

* [Major] refactor imports * [CI] update tests * [Feat] add hfd and hf-mirror * [doc] more guide on loading datsets * [fix] update hfd * [fix] dataset formatting * [fix] load with hfd * [fix] resolve huggingface-cli import error * [CI] test dataset formatting * [CI] skip OOM * [fix] fix failed tests * support gpt-4o * update customize dataset * [doc] customize model * [CI] annotate failures * [Feat] load evaluation_data * [Feat] hfd_cache_path * [CI] split pytest * [CI] fix splits * [CI] skip cuda * [doc] add CONTRIBUTING.md * [fix] evaluation_data is not None * [CI] download nltk * [CI] fix temp folder * [CI] fix cache path * [CI] skip DatasetGenerationError * [CI] re-run failures * [ci] fix winograd * [CI] fix pytest-results-action * [CI] fix * [ci] fix xlsum
RUCAIBox · Jun 6, 2024 · 82daf6e · 82daf6e
1 parent e40fcf8
commit 82daf6e
Show file tree

Hide file tree

Showing 65 changed files with 2,189 additions and 1,021 deletions.
diff --git a/.github/workflows/isort-check.yml b/.github/workflows/isort-check.yml
@@ -7,7 +7,7 @@ on:
         - 'utilization/**'
 
 jobs:
-  build:
+  formatting-check:
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v3

diff --git a/.github/workflows/pytest-check.yml b/.github/workflows/pytest-check.yml
@@ -9,30 +9,78 @@ on:
         - '.github/workflows/**'
 
 jobs:
-  build:
-    name: Run tests
+  Pytest:
+    name: subtest
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: ["3.8.18"]
+        group: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
 
     steps:
-      - uses: szenius/set-timezone@v1.2
+      - uses: szenius/set-timezone@v2.0
         with:
-          timezoneLinux: "Europe/Berlin"
+          timezoneLinux: "Asia/Shanghai"
       - uses: actions/checkout@v3
-      - name: Set up Python ${{ matrix.python-version }}
+      - name: Set up Python 3.8.18
         uses: actions/setup-python@v4
         with:
-          python-version: ${{ matrix.python-version }}
-      - name: Install uv
-        run: pip install uv pip -U
+          python-version: 3.8.18
       - name: Install dependencies
-        run: uv pip install -r tests/requirements-tests.txt --system
-      - name: Install isolation dependencies
-        run: uv pip install vllm --no-build-isolation --system
-      - uses: pavelzw/pytest-action@v2
+        run: |
+          pip install uv pip -U
+          uv pip install -r tests/requirements-tests.txt --system
+          uv pip install vllm --no-build-isolation --system
+      - name: Run tests
+        run: pytest --cov --junit-xml=test-results.xml --splits 10 --group ${{ matrix.group }} --reruns 3 --only-rerun PermissionError
+        env:
+            GITHUB_ACTION: 1
+      - name: Surface failing tests
+        if: always()
+        uses: pmeier/pytest-results-action@multi-testsuites
         with:
-          emoji: false
-          verbose: true
-          job-summary: true
+          # A list of JUnit XML files, directories containing the former, and wildcard
+          # patterns to process.
+          # See @actions/glob for supported patterns.
+          path: test-results.xml
+
+          # (Optional) Add a summary of the results at the top of the report
+          summary: true
+
+          # (Optional) Select which results should be included in the report.
+          # Follows the same syntax as `pytest -r`
+          display-options: fEX
+
+          # (Optional) Fail the workflow if no JUnit XML was found.
+          fail-on-empty: true
+
+          # (Optional) Title of the test results section in the workflow summary
+          title: Test results
+      - name: Upload coverage
+        uses: actions/upload-artifact@v2
+        with:
+          name: coverage${{ matrix.group }}
+          path: .coverage
+
+  Coverage:
+    needs: Pytest
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - name: Set up Python 3.8.18
+        uses: actions/setup-python@v4
+        with:
+          python-version: 3.8.18
+      - name: Install uv
+        run: |
+          pip install uv pip -U
+          uv pip install -r tests/requirements-tests.txt --system
+          uv pip install vllm --no-build-isolation --system
+      - name: Download all artifacts
+        # Downloads coverage1, coverage2, etc.
+        uses: actions/download-artifact@v2
+      - name: Run coverage
+        run: |
+          coverage combine coverage*/.coverage*
+          coverage report --fail-under=90
+          coverage xml
+      - uses: codecov/codecov-action@v1
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,82 @@
+# Contributing
+
+Thanks for your interest in contributing to LLMBox! We welcome and appreciate contributions.
+To report bugs, create a [GitHub issue](https://github.com/RUCAIBox/LLMBox/issues).
+
+## Contribution Guide
+### 1. Fork the Official Repository
+
+Fork [LLMBox repository](https://github.com/RUCAIBox/LLMBox) into your own account.
+Clone your own forked repository into your local environment.
+
+```shell
+git clone [email protected]:<YOUR-USERNAME>/LLMBox.git
+```
+
+### 2. Configure Git
+
+Set the official repository as your [upstream](https://www.atlassian.com/git/tutorials/git-forks-and-upstreams) to synchronize with the latest update in the official repository.
+Add the original repository as upstream
+
+```shell
+cd LLMBox
+git remote add upstream [email protected]:RUCAIBox/LLMBox.git
+```
+
+Verify that the remote is set.
+```shell
+git remote -v
+```
+You should see both `origin` and `upstream` in the output.
+
+### 3. Synchronize with Official Repository
+Synchronize latest commit with official repository before coding.
+
+```shell
+git fetch upstream
+git checkout main
+git merge upstream/main
+git push origin main
+```
+
+### 4. Create a New Branch And Open a Pull Request
+After you finish implementation, open forked repository. The source branch is your new branch, and the target branch is `RUCAIBox/LLMBox` `main` branch. Then PR should appears in [LLMBox PRs](https://github.com/RUCAIBox/LLMBox/pulls).
+
+Then LLMBox team will review your code.
+
+## PR Rules
+
+### 1. Pull Request title
+
+As described in [here](https://github.com/commitizen/conventional-commit-types/blob/master/index.json), a valid PR title should begin with one of the following prefixes:
+
+- `feat`: A new feature
+- `fix`: A bug fix
+- `doc`: Documentation only changes
+- `refactor`: A code change that neither fixes a bug nor adds a feature
+- `style`: A refactoring that improves code style
+- `test`: Adding missing tests or correcting existing tests
+- `ci`: Changes to CI configuration files and scripts (example scopes: `.github`, `ci` (Buildkite))
+- `revert`: Reverts a previous commit
+
+For example, a PR title could be:
+- `refactor: modify package path`
+- `feat(training): xxxx`, where `(training)` means that this PR mainly focuses on the training component.
+
+You may also check out previous PRs in the [PR list](https://github.com/RUCAIBox/LLMBox/pulls).
+
+### 2. Pull Request description
+
+- If your PR is small (such as a typo fix), you can go brief.
+- If it is large and you have changed a lot, it's better to write more details.
+
+
+## How to begin
+Please refer to the README in each module:
+- [training](./training)
+- [utilization](./utilization)
+- [docs](./docs)
+
+## Tests
+Please navigate to `tests` folder to see existing test suites.
+At the moment, we have three kinds of tests: `pytest`, `isort`, and `yapf`.
diff --git a/README.md b/README.md
@@ -57,7 +57,7 @@ bash bash/run_7b_ds3.sh
 To utilize your model, or evaluate an existing model, you can run the following command:
 
 ```python
-python inference.py -m gpt-3.5-turbo -d copa  # --num_shot 0 --model_type instruction
+python inference.py -m gpt-3.5-turbo -d copa  # --num_shot 0 --model_type chat
 ```
 
 This is default to run the OpenAI GPT 3.5 turbo model on the CoPA dataset in a zero-shot manner.
@@ -118,12 +118,11 @@ We provide a broad support on Huggingface models (e.g. `LLaMA-3`, `Mistral`, or
 Currently a total of 56+ commonly used datasets are supported, including: `HellaSwag`, `MMLU`, `GSM8K`, `GPQA`, `AGIEval`, `CEval`, and `CMMLU`. For a full list of supported models and datasets, view the [utilization](https://github.com/RUCAIBox/LLMBox/tree/main/utilization) documentation.
 
 ```bash
-python inference.py \
+CUDA_VISIBLE_DEVICES=0 python inference.py \
   -m llama-2-7b-hf \
   -d mmlu agieval:[English] \
-  --model_type instruction \
+  --model_type chat \
   --num_shot 5 \
-  --cuda 0 \
   --ranking_type ppl_no_option
 ```
 

diff --git a/docs/examples/customize_dataset.py b/docs/examples/customize_dataset.py
@@ -0,0 +1,47 @@
+import os
+import sys
+
+sys.path.append(".")
+os.environ["CUDA_VISIBLE_DEVICES"] = "0"
+
+from utilization import DatasetArguments, ModelArguments, get_evaluator, register_dataset
+from utilization.dataset import GenerationDataset
+
+
+@register_dataset(name="my_data")
+class MyData(GenerationDataset):
+
+    instruction = "Reply to my message: {input}\nReply:"
+    metrics = []
+
+    def format_instance(self, instance: dict) -> dict:
+        return instance
+
+    @property
+    def references(self):
+        return [i["target"] for i in self.evaluation_data]
+
+
+evaluator = get_evaluator(
+    model_args=ModelArguments(model_name_or_path="gpt-4o"),
+    dataset_args=DatasetArguments(
+        dataset_names=["my_data"],
+        num_shots=1,
+        max_example_tokens=2560,
+    ),
+    evaluation_data=[
+        {
+            "input": "Hello",
+            "target": "Hi"
+        },
+        {
+            "input": "How are you?",
+            "target": "I'm fine, thank you!"
+        },
+    ],
+    example_data=[{
+        "input": "What's the weather like today?",
+        "target": "It's sunny today."
+    }]
+)
+evaluator.evaluate()
diff --git a/docs/examples/customize_huggingface_model.py b/docs/examples/customize_huggingface_model.py
@@ -1,12 +1,14 @@
+import sys
+
 import torch
 from transformers import LlamaForCausalLM
 
-from utilization import Evaluator
-from utilization.model.huggingface_model import get_model_max_length, load_tokenizer
-from utilization.utils import DatasetArguments, ModelArguments
+sys.path.append(".")
+from utilization import DatasetArguments, ModelArguments, get_evaluator
 
 
 def load_hf_model(model_args: ModelArguments):
+    from utilization.model.huggingface_model import get_model_max_length, load_tokenizer
 
     # load your own model
     model = LlamaForCausalLM.from_pretrained(
@@ -24,7 +26,7 @@ def load_hf_model(model_args: ModelArguments):
     return model, tokenizer
 
 
-evaluator = Evaluator(
+evaluator = get_evaluator(
     model_args=ModelArguments(
         model_name_or_path="../your-model-path",
         model_type="chat",

diff --git a/docs/utilization/customize-dataset.md → docs/utilization/how-to-customize-dataset.md b/docs/utilization/customize-dataset.md → docs/utilization/how-to-customize-dataset.md
@@ -2,6 +2,8 @@
 
 If you find some datasets are not supported in the current version, feel free to implement your own dataset and submit a PR.
 
+See a full list of supported datasets at [here](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/supported-datasets.md).
+
 ## Choose the Right Dataset
 
 We provide two types of datasets: [`GenerationDataset`](https://github.com/RUCAIBox/LLMBox/tree/main/utilization/dataset/generation_dataset.py) and [`MultipleChoiceDataset`](https://github.com/RUCAIBox/LLMBox/tree/main/utilization/dataset/multiple_choice_dataset.py).
@@ -35,7 +37,7 @@ These are the attributes you can define in a new dataset:
 
 - `example_set` (`Optional[str]`): The example split of dataset. Example data will be automatically loaded if this is not None.
 
-- `load_args` (`Union[Tuple[str], Tuple[str, str], Tuple[()]]`, **required\***): Arguments for loading the dataset with huggingface `load_dataset`. See [load from source data](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/customize-dataset.md#load-from-source-data) for details.
+- `load_args` (`Union[Tuple[str], Tuple[str, str], Tuple[()]]`, **required\***): Arguments for loading the dataset with huggingface `load_dataset`. See [load from source data](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/how-to-customize-dataset.md#load-from-source-data) for details.
 
 - `extra_model_args` (`Dict[str, Any]`): Extra arguments for the model like `temperature`, `stop` etc. See `set_generation_args`, `set_prob_args`, and `set_ppl_args` for details.
 
@@ -45,7 +47,7 @@ Then implement the following methods or properties:
 - `references` (**required**): Return the reference answers for evaluation.
 - `init_arguments`: Initialize the arguments for the dataset. This is called before the raw dataset is loaded.
 
-See [here](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/customize-dataset.md#advanced-topics) for advanced topics.
+See [here](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/how-to-customize-dataset.md#advanced-topics) for advanced topics.
 
 
 ## Load from Source Data

diff --git a/docs/utilization/how-to-customize-model.md b/docs/utilization/how-to-customize-model.md
@@ -0,0 +1,28 @@
+# How to Customize Model
+
+## Customizing HuggingFace Models
+
+If you are building on your own model, such as using a fine-tuned model, you can evaluate it easily from python script. Detailed steps and example code are provided in the [customize HuggingFace model guide](https://github.com/RUCAIBox/LLMBox/tree/main/docs/examples/customize_huggingface_model.py).
+
+## Adding a New Model Provider
+
+If you're integrating a new model provider, begin by extending the [`Model`](https://github.com/RUCAIBox/LLMBox/tree/main/utilization/model/model.py) class. Implement essential methods such as `generation`, `get_ppl` (get perplexity), and `get_prob` (get probability) to support different functionalities. For instance, here's how you might implement the `generation` method for a new model:
+
+```python
+class NewModel(Model):
+
+    model_backend = "new_provider"
+
+    def call_model(self, batched_inputs: List[str]) -> List[Any]:
+        return ...  # call to model, e.g., self.model.generate(...)
+
+    def to_text(self, result: Any) -> str:
+        return ...  # convert result to text, e.g., result['text']
+
+    def generation(self, batched_inputs: List[str]) -> List[str]:
+        results = self.call_model(batched_inputs)
+        results = [to_text(result) for result in results]
+        return results
+```
+
+And then, you should register your model in the [`load`](https://github.com/RUCAIBox/LLMBox/tree/main/utilization/model/load.py) file.