-
Notifications
You must be signed in to change notification settings - Fork 79
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* [Major] refactor imports * [CI] update tests * [Feat] add hfd and hf-mirror * [doc] more guide on loading datsets * [fix] update hfd * [fix] dataset formatting * [fix] load with hfd * [fix] resolve huggingface-cli import error * [CI] test dataset formatting * [CI] skip OOM * [fix] fix failed tests * support gpt-4o * update customize dataset * [doc] customize model * [CI] annotate failures * [Feat] load evaluation_data * [Feat] hfd_cache_path * [CI] split pytest * [CI] fix splits * [CI] skip cuda * [doc] add CONTRIBUTING.md * [fix] evaluation_data is not None * [CI] download nltk * [CI] fix temp folder * [CI] fix cache path * [CI] skip DatasetGenerationError * [CI] re-run failures * [ci] fix winograd * [CI] fix pytest-results-action * [CI] fix * [ci] fix xlsum
- Loading branch information
Showing
65 changed files
with
2,189 additions
and
1,021 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
# Contributing | ||
|
||
Thanks for your interest in contributing to LLMBox! We welcome and appreciate contributions. | ||
To report bugs, create a [GitHub issue](https://github.com/RUCAIBox/LLMBox/issues). | ||
|
||
## Contribution Guide | ||
### 1. Fork the Official Repository | ||
|
||
Fork [LLMBox repository](https://github.com/RUCAIBox/LLMBox) into your own account. | ||
Clone your own forked repository into your local environment. | ||
|
||
```shell | ||
git clone [email protected]:<YOUR-USERNAME>/LLMBox.git | ||
``` | ||
|
||
### 2. Configure Git | ||
|
||
Set the official repository as your [upstream](https://www.atlassian.com/git/tutorials/git-forks-and-upstreams) to synchronize with the latest update in the official repository. | ||
Add the original repository as upstream | ||
|
||
```shell | ||
cd LLMBox | ||
git remote add upstream [email protected]:RUCAIBox/LLMBox.git | ||
``` | ||
|
||
Verify that the remote is set. | ||
```shell | ||
git remote -v | ||
``` | ||
You should see both `origin` and `upstream` in the output. | ||
|
||
### 3. Synchronize with Official Repository | ||
Synchronize latest commit with official repository before coding. | ||
|
||
```shell | ||
git fetch upstream | ||
git checkout main | ||
git merge upstream/main | ||
git push origin main | ||
``` | ||
|
||
### 4. Create a New Branch And Open a Pull Request | ||
After you finish implementation, open forked repository. The source branch is your new branch, and the target branch is `RUCAIBox/LLMBox` `main` branch. Then PR should appears in [LLMBox PRs](https://github.com/RUCAIBox/LLMBox/pulls). | ||
|
||
Then LLMBox team will review your code. | ||
|
||
## PR Rules | ||
|
||
### 1. Pull Request title | ||
|
||
As described in [here](https://github.com/commitizen/conventional-commit-types/blob/master/index.json), a valid PR title should begin with one of the following prefixes: | ||
|
||
- `feat`: A new feature | ||
- `fix`: A bug fix | ||
- `doc`: Documentation only changes | ||
- `refactor`: A code change that neither fixes a bug nor adds a feature | ||
- `style`: A refactoring that improves code style | ||
- `test`: Adding missing tests or correcting existing tests | ||
- `ci`: Changes to CI configuration files and scripts (example scopes: `.github`, `ci` (Buildkite)) | ||
- `revert`: Reverts a previous commit | ||
|
||
For example, a PR title could be: | ||
- `refactor: modify package path` | ||
- `feat(training): xxxx`, where `(training)` means that this PR mainly focuses on the training component. | ||
|
||
You may also check out previous PRs in the [PR list](https://github.com/RUCAIBox/LLMBox/pulls). | ||
|
||
### 2. Pull Request description | ||
|
||
- If your PR is small (such as a typo fix), you can go brief. | ||
- If it is large and you have changed a lot, it's better to write more details. | ||
|
||
|
||
## How to begin | ||
Please refer to the README in each module: | ||
- [training](./training) | ||
- [utilization](./utilization) | ||
- [docs](./docs) | ||
|
||
## Tests | ||
Please navigate to `tests` folder to see existing test suites. | ||
At the moment, we have three kinds of tests: `pytest`, `isort`, and `yapf`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
import os | ||
import sys | ||
|
||
sys.path.append(".") | ||
os.environ["CUDA_VISIBLE_DEVICES"] = "0" | ||
|
||
from utilization import DatasetArguments, ModelArguments, get_evaluator, register_dataset | ||
from utilization.dataset import GenerationDataset | ||
|
||
|
||
@register_dataset(name="my_data") | ||
class MyData(GenerationDataset): | ||
|
||
instruction = "Reply to my message: {input}\nReply:" | ||
metrics = [] | ||
|
||
def format_instance(self, instance: dict) -> dict: | ||
return instance | ||
|
||
@property | ||
def references(self): | ||
return [i["target"] for i in self.evaluation_data] | ||
|
||
|
||
evaluator = get_evaluator( | ||
model_args=ModelArguments(model_name_or_path="gpt-4o"), | ||
dataset_args=DatasetArguments( | ||
dataset_names=["my_data"], | ||
num_shots=1, | ||
max_example_tokens=2560, | ||
), | ||
evaluation_data=[ | ||
{ | ||
"input": "Hello", | ||
"target": "Hi" | ||
}, | ||
{ | ||
"input": "How are you?", | ||
"target": "I'm fine, thank you!" | ||
}, | ||
], | ||
example_data=[{ | ||
"input": "What's the weather like today?", | ||
"target": "It's sunny today." | ||
}] | ||
) | ||
evaluator.evaluate() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# How to Customize Model | ||
|
||
## Customizing HuggingFace Models | ||
|
||
If you are building on your own model, such as using a fine-tuned model, you can evaluate it easily from python script. Detailed steps and example code are provided in the [customize HuggingFace model guide](https://github.com/RUCAIBox/LLMBox/tree/main/docs/examples/customize_huggingface_model.py). | ||
|
||
## Adding a New Model Provider | ||
|
||
If you're integrating a new model provider, begin by extending the [`Model`](https://github.com/RUCAIBox/LLMBox/tree/main/utilization/model/model.py) class. Implement essential methods such as `generation`, `get_ppl` (get perplexity), and `get_prob` (get probability) to support different functionalities. For instance, here's how you might implement the `generation` method for a new model: | ||
|
||
```python | ||
class NewModel(Model): | ||
|
||
model_backend = "new_provider" | ||
|
||
def call_model(self, batched_inputs: List[str]) -> List[Any]: | ||
return ... # call to model, e.g., self.model.generate(...) | ||
|
||
def to_text(self, result: Any) -> str: | ||
return ... # convert result to text, e.g., result['text'] | ||
|
||
def generation(self, batched_inputs: List[str]) -> List[str]: | ||
results = self.call_model(batched_inputs) | ||
results = [to_text(result) for result in results] | ||
return results | ||
``` | ||
|
||
And then, you should register your model in the [`load`](https://github.com/RUCAIBox/LLMBox/tree/main/utilization/model/load.py) file. |
Oops, something went wrong.