Skip to content

Commit

Permalink
Merge pull request #1 from Raldir/main
Browse files Browse the repository at this point in the history
CaLF initial commit
  • Loading branch information
zhiqiangdon authored Jul 15, 2024
2 parents d7a0052 + ab5df18 commit 3913f23
Show file tree
Hide file tree
Showing 106 changed files with 335,627 additions and 6 deletions.
179 changes: 179 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# Custom
models/
data/*
!data/asqa
!data/prompts
!src/data
aws/
bin/installation/aws/
bin/installation/awscliv2.zip
index/
exp_out/
bin/s3/s3_credentials
bin/s3/aws/
bin/s3/awscliv2.zip
bin/aws/
bin/awscliv2.zip
lightning_logs/
offload/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
71 changes: 65 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,71 @@
## My Project
## Description

TODO: Fill this README out!
This repository maintains the code for **CaLF**, with the associated ACL 2024 paper: [Learning to Generate Answers with Citations via Factual Consistency Models](https://arxiv.org/abs/2406.13124).

Be sure to:
> Large Language Models (LLMs) frequently hallucinate, impeding their reliability in mission-critical situations. One approach to address this issue is to provide citations to relevant sources alongside generated content, enhancing the verifiability of generations. However, citing passages accurately in answers remains a substantial challenge. This paper proposes a weakly-supervised fine-tuning method leveraging factual consistency models (FCMs). Our approach alternates between generating texts with citations and supervised fine-tuning with FCM-filtered citation data. Focused learning is integrated into the objective, directing the fine-tuning process to emphasise the factual unit tokens, as measured by an FCM. Results on the ALCE few-shot citation benchmark with various instruction-tuned LLMs demonstrate superior performance compared to in-context learning, vanilla supervised fine-tuning, and state-of-the-art methods, with an average improvement of $34.1$, $15.5$, and $10.5$ citation F$_1$ points, respectively. Moreover, in a domain transfer setting we show that the obtained citation generation ability robustly transfers to unseen datasets. Notably, our citation improvements contribute to the lowest factual error rate across baselines.

## Installation

Install relevant packages within a new conda environment by calling the installation script from the root directory:
```
./bin/installation/install.sh
```

Download the AlignScore factual consistency model and place it into the expected folder by calling the following script:

```
./bin/installation/download_fcm.sh
```


### Download Data

Sample data from the ASQA dataset to run CaLF out of the box is part of the repository in `data/asqa/`. CaLF can be trained on any long-form question answering dataset. To train a model on the full data for `ASQA`, `ELI5`, and `FactScore` as used for the experiments described in the paper, please download the data [here](https://drive.google.com/file/d/1VulWcG80vQ6V7TZcq4kflitE5Xb7IHvE/view?usp=sharing).


## Running Experiments

To check first whether the code pipeline will execute correcty, you can run CaLF on a small subset of the data by calling

```
./bin/run_calf_debug.sh default asqa lora_100_steps_bootstrapping_chat_templates token_rescaling mistralorca gtr all alignscore_threshold_09 0 0,1,2,3,4,5,6,7
```

To train and evaluate CaLF on the full ASQA dataset as provided in `data/asqa` run:

```
./bin/run_calf.sh default asqa lora_100_steps_bootstrapping_chat_templates token_rescaling mistralorca gtr all alignscore_threshold_09 0 0,1,2,3,4,5,6,7
```

1. `default`: answer truncation mode (none)
2. `asqa`: dataset
3. `lora_100_steps_bootstrapping_chat_templates`: environment (lora fine-tuning, 100 steps, chat templates, and in-context in first iteration)
4. `token_rescaling`: focused learning mode (either `token_rescaling` or `none`)
6. `mistralorca`: LLM
7. `gtr`: Retrieval system (the retrieval is pre-compiled with ASQA only supporting GTR, and ELI5 only supporting BM25)
8. `all`: Training samples to use
9. `alignscore_threshold_09`: Generation of weakly-supervised data with filtering via AlignScore and a threshold of 0.9
10. `0`: Random seed
11. `0,1,2,3,4,5,6,7`: CUDA visible devices

All configuration settings for each argument can be found in the folder `configs`.

The script trains the LLM iteratively on a fully and weakly supervsed data and evaluates its performance after the training process is completed.

If you want to call the evaluation script independently, call `val.sh` with the same arguments as above. If you have trained a CaLF model which you wish to evaluate, you can call `val_from_saved.sh`, which loads the trained weights before evaluation. Finally, `val_from_saved_transfer.sh` can be used to evaluate a model on a new dataset in a domain-transfer scenario (i.e. results in Table 2). In addition to the afforementioned arguments, the transfer evaluation script takes two additional arguments: the target dataset name and the target dataset retrieval system (13 arguments in total).

### Results
CaLF stops training on the sample data after 7 iterations, producing the following results:


| Model | Rougle-L | EM Recall (Grounded) | Citation F1 |
|------------------------|----------|----------------------|-------------|
| Baseline (sample data) | 38.2 | 29.0 | 72.6 |
| CaLF (sample data) | 40.3 | 28.9 | 81.4 |
| CaLF | 40.9 | 34.5 | 80.5 |

Already with the 240 unlabeled training instances of the sample data (1/4 of the full training data), we observe substential citation improvements compared to our FT-baseline. However, for best results train on the entire data collection (see above).

* Change the title in this README
* Edit your repository description on GitHub

## Security

Expand All @@ -14,4 +74,3 @@ See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more inform
## License

This project is licensed under the Apache-2.0 License.

2 changes: 2 additions & 0 deletions bin/installation/download_fcm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
mkdir models
wget https://huggingface.co/yzha/AlignScore/resolve/main/AlignScore-large.ckpt -P models/
11 changes: 11 additions & 0 deletions bin/installation/install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
conda create -n lfqa python=3.10
conda activate lfqa

python3 -m pip install torch
python3 -m pip install -r requirements.txt
python3 -m pip uninstall faiss-cpu
python3 -m pip uninstall faiss-gpu
python3 -m pip install faiss-gpu

python3 -m spacy download en_core_web_sm
python3 -m nltk.downloader punkt
1 change: 1 addition & 0 deletions bin/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
CUDA_VISIBLE_DEVICES=${10} python3 -m src.run -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8/${9}/iter_${11} seed=${9} ws_iteration=${11}
31 changes: 31 additions & 0 deletions bin/run_calf.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
for i in 1 2 3 4 5 6 7 8 9 10
do
if [ $i = "1" ]; then
./bin/run.sh $1 $2 $3 $4 $5 $6 $7 $8 $9 ${10} $i

nvidia-smi | grep 'python' | awk '{ print $5 }' | xargs -n1 kill -9
sleep 5
else
./bin/train.sh $1 $2 $3 $4 $5 $6 $7 $8 $9 ${10} $i

nvidia-smi | grep 'python' | awk '{ print $5 }' | xargs -n1 kill -9
sleep 5

./bin/run_from_saved.sh $1 $2 $3 $4 $5 $6 $7 $8 $9 ${10} $i
status=$?

echo $status

nvidia-smi | grep 'python' | awk '{ print $5 }' | xargs -n1 kill -9
sleep 5

if [ $status -ne 0 ]; then
break
fi
fi
done

./bin/val_from_saved.sh $1 $2 $3 $4 $5 $6 $7 $8 $9 ${10} $i

nvidia-smi | grep 'python' | awk '{ print $5 }' | xargs -n1 kill -9
sleep 5
32 changes: 32 additions & 0 deletions bin/run_calf_debug.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
for i in 1 2 3
do
if [ $i = "1" ]; then
./bin/run_debug.sh $1 $2 $3 $4 $5 $6 $7 $8 $9 ${10} $i

nvidia-smi | grep 'python' | awk '{ print $5 }' | xargs -n1 kill -9
sleep 5
else
./bin/train_debug.sh $1 $2 $3 $4 $5 $6 $7 $8 $9 ${10} $i

nvidia-smi | grep 'python' | awk '{ print $5 }' | xargs -n1 kill -9
sleep 5

./bin/run_from_saved_debug.sh $1 $2 $3 $4 $5 $6 $7 $8 $9 ${10} $i

status=$?

echo $status

nvidia-smi | grep 'python' | awk '{ print $5 }' | xargs -n1 kill -9
sleep 5

if [ $status -ne 0 ]; then
break
fi
fi
done

./bin/val_from_saved_debug.sh $1 $2 $3 $4 $5 $6 $7 $8 $9 ${10} $i

nvidia-smi | grep 'python' | awk '{ print $5 }' | xargs -n1 kill -9
sleep 5
1 change: 1 addition & 0 deletions bin/run_debug.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
CUDA_VISIBLE_DEVICES=${10} python3 -m src.run -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8_DEBUG/${9}/iter_${11} seed=${9} is_debug=True ws_iteration=${11}
3 changes: 3 additions & 0 deletions bin/run_from_saved.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8/${9}/iter_${11}

CUDA_VISIBLE_DEVICES=${10} python3 -m src.run -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=${exp_name} seed=${9} load_weight=exp_out/${exp_name}/finish.pt ws_iteration=${11}
3 changes: 3 additions & 0 deletions bin/run_from_saved_debug.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8_DEBUG/${9}/iter_${11}

CUDA_VISIBLE_DEVICES=${10} python3 -m src.run -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=${exp_name} seed=${9} load_weight=exp_out/${exp_name}/finish.pt is_debug=True ws_iteration=${11}
3 changes: 3 additions & 0 deletions bin/train.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
export TOKENIZERS_PARALLELISM=true

CUDA_VISIBLE_DEVICES=${10} python3 -m src.train -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8/${9}/iter_${11} seed=${9} ws_iteration=${11}
3 changes: 3 additions & 0 deletions bin/train_debug.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
export TOKENIZERS_PARALLELISM=true

CUDA_VISIBLE_DEVICES=${10} python3 -m src.train -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8_DEBUG/${9}/iter_${11} seed=${9} num_steps=50 is_debug=True ws_iteration=${11}
1 change: 1 addition & 0 deletions bin/val.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
CUDA_VISIBLE_DEVICES=${10} python3 -m src.val -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8/${9}/iter_${11} seed=${9} ws_iteration=${11}
1 change: 1 addition & 0 deletions bin/val_debug.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
CUDA_VISIBLE_DEVICES=${10} python3 -m src.val -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8_DEBUG/${9}/iter_${11} seed=${9} is_debug=True ws_iteration=${11}
3 changes: 3 additions & 0 deletions bin/val_from_saved.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8/${9}/iter_${11}

CUDA_VISIBLE_DEVICES=${10} python3 -m src.val -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=${exp_name} seed=${9} load_weight=exp_out/${exp_name}/finish.pt ws_iteration=${11}
3 changes: 3 additions & 0 deletions bin/val_from_saved_debug.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8_DEBUG/${9}/iter_${11}

CUDA_VISIBLE_DEVICES=${10} python3 -m src.val -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=${exp_name} seed=${9} load_weight=exp_out/${exp_name}/finish.pt is_debug=True ws_iteration=${11}
5 changes: 5 additions & 0 deletions bin/val_from_saved_transfer.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
exp_name=${2}/TRANSFER_answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8/${9}/iter_${11}
exp_name_weights=${12}/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_${13}__samples_$7__ws_$8/${9}/iter_${11}


CUDA_VISIBLE_DEVICES=${10} python3 -m src.val -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=${exp_name} seed=${9} load_weight=exp_out/${exp_name_weights}/finish.pt ws_iteration=${11}
Loading

0 comments on commit 3913f23

Please sign in to comment.