Merge pull request #1 from Raldir/main

CaLF initial commit
amazon-science · Jul 15, 2024 · 3913f23 · 3913f23
2 parents d7a0052 + ab5df18
commit 3913f23
Show file tree

Hide file tree

Showing 106 changed files with 335,627 additions and 6 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,179 @@
+# Custom
+models/
+data/*
+!data/asqa
+!data/prompts
+!src/data
+aws/
+bin/installation/aws/
+bin/installation/awscliv2.zip
+index/
+exp_out/
+bin/s3/s3_credentials
+bin/s3/aws/
+bin/s3/awscliv2.zip
+bin/aws/
+bin/awscliv2.zip
+lightning_logs/
+offload/
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
diff --git a/README.md b/README.md
@@ -1,11 +1,71 @@
-## My Project
+## Description
 
-TODO: Fill this README out!
+This repository maintains the code for **CaLF**, with the associated ACL 2024 paper: [Learning to Generate Answers with Citations via Factual Consistency Models](https://arxiv.org/abs/2406.13124).
 
-Be sure to:
+> Large Language Models (LLMs) frequently hallucinate, impeding their reliability in mission-critical situations. One approach to address this issue is to provide citations to relevant sources alongside generated content, enhancing the verifiability of generations. However, citing passages accurately in answers remains a substantial challenge. This paper proposes a weakly-supervised fine-tuning method leveraging factual consistency models (FCMs). Our approach alternates between generating texts with citations and supervised fine-tuning with FCM-filtered citation data. Focused learning is integrated into the objective, directing the fine-tuning process to emphasise the factual unit tokens, as measured by an FCM. Results on the ALCE few-shot citation benchmark with various instruction-tuned LLMs demonstrate superior performance compared to in-context learning, vanilla supervised fine-tuning, and state-of-the-art methods,  with an average improvement of $34.1$, $15.5$, and $10.5$ citation F$_1$ points, respectively. Moreover, in a domain transfer setting we show that the obtained citation generation ability robustly transfers to unseen datasets. Notably, our citation improvements contribute to the lowest factual error rate across baselines.
+
+## Installation
+
+Install relevant packages within a new conda environment by calling the installation script from the root directory:
+```
+./bin/installation/install.sh
+```
+
+Download the AlignScore factual consistency model and place it into the expected folder by calling the following script:
+
+```
+./bin/installation/download_fcm.sh
+```
+
+
+### Download Data
+
+Sample data from the ASQA dataset to run CaLF out of the box is part of the repository in `data/asqa/`. CaLF can be trained on any long-form question answering dataset. To train a model on the full data for `ASQA`, `ELI5`, and `FactScore` as used for the experiments described in the paper, please download the data [here](https://drive.google.com/file/d/1VulWcG80vQ6V7TZcq4kflitE5Xb7IHvE/view?usp=sharing).
+
+
+## Running Experiments
+
+To check first whether the code pipeline will execute correcty, you can run CaLF on a small subset of the data by calling
+
+```
+./bin/run_calf_debug.sh default asqa lora_100_steps_bootstrapping_chat_templates token_rescaling mistralorca gtr all alignscore_threshold_09 0 0,1,2,3,4,5,6,7
+```
+
+To train and evaluate CaLF on the full ASQA dataset as provided in `data/asqa` run:
+
+```
+./bin/run_calf.sh default asqa lora_100_steps_bootstrapping_chat_templates token_rescaling mistralorca gtr all alignscore_threshold_09 0 0,1,2,3,4,5,6,7
+```
+
+1. `default`: answer truncation mode (none)
+2. `asqa`: dataset
+3. `lora_100_steps_bootstrapping_chat_templates`: environment (lora fine-tuning, 100 steps, chat templates, and in-context in first iteration)
+4. `token_rescaling`: focused learning mode (either `token_rescaling` or `none`)
+6. `mistralorca`: LLM
+7. `gtr`: Retrieval system (the retrieval is pre-compiled with ASQA only supporting GTR, and ELI5 only supporting BM25)
+8. `all`: Training samples to use
+9. `alignscore_threshold_09`: Generation of weakly-supervised data with filtering via AlignScore and a threshold of 0.9
+10. `0`: Random seed
+11. `0,1,2,3,4,5,6,7`: CUDA visible devices
+
+All configuration settings for each argument can be found in the folder `configs`.
+
+The script trains the LLM iteratively on a fully and weakly supervsed data and evaluates its performance after the training process is completed. 
+
+If you want to call the evaluation script independently, call `val.sh` with the same arguments as above. If you have trained a CaLF model which you wish to evaluate, you can call `val_from_saved.sh`, which loads the trained weights before evaluation. Finally, `val_from_saved_transfer.sh` can be used to evaluate a model on a new dataset in a domain-transfer scenario (i.e. results in Table 2). In addition to the afforementioned arguments, the transfer evaluation script takes two additional arguments: the target dataset name and the target dataset retrieval system (13 arguments in total).
+
+### Results
+CaLF stops training on the sample data after 7 iterations, producing the following results:
+
+
+| Model                  | Rougle-L | EM Recall (Grounded) | Citation F1 |
+|------------------------|----------|----------------------|-------------|
+| Baseline (sample data) | 38.2     | 29.0                 | 72.6        |
+| CaLF (sample data)     | 40.3     | 28.9                 | 81.4        |
+| CaLF                   | 40.9     | 34.5                 | 80.5        |
+
+Already with the 240 unlabeled training instances of the sample data (1/4 of the full training data), we observe substential citation improvements compared to our FT-baseline. However, for best results train on the entire data collection (see above).
 
-* Change the title in this README
-* Edit your repository description on GitHub
 
 ## Security
 
@@ -14,4 +74,3 @@ See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more inform
 ## License
 
 This project is licensed under the Apache-2.0 License.
-
diff --git a/bin/installation/download_fcm.sh b/bin/installation/download_fcm.sh
@@ -0,0 +1,2 @@
+mkdir models
+wget https://huggingface.co/yzha/AlignScore/resolve/main/AlignScore-large.ckpt -P models/
diff --git a/bin/installation/install.sh b/bin/installation/install.sh
@@ -0,0 +1,11 @@
+conda create -n lfqa python=3.10
+conda activate lfqa
+
+python3 -m pip install torch
+python3 -m pip install -r requirements.txt
+python3 -m pip uninstall faiss-cpu
+python3 -m pip uninstall faiss-gpu
+python3 -m pip install faiss-gpu
+
+python3 -m spacy download en_core_web_sm
+python3 -m nltk.downloader punkt
diff --git a/bin/run.sh b/bin/run.sh
@@ -0,0 +1 @@
+CUDA_VISIBLE_DEVICES=${10} python3 -m src.run -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8/${9}/iter_${11} seed=${9} ws_iteration=${11}
diff --git a/bin/run_calf.sh b/bin/run_calf.sh
@@ -0,0 +1,31 @@
+for i in 1 2 3 4 5 6 7 8 9 10
+    do
+        if [ $i = "1" ]; then
+            ./bin/run.sh $1 $2 $3 $4 $5 $6 $7 $8 $9 ${10} $i
+
+            nvidia-smi | grep 'python' | awk '{ print $5 }' | xargs -n1 kill -9
+            sleep 5
+        else
+            ./bin/train.sh $1 $2 $3 $4 $5 $6 $7 $8 $9 ${10} $i
+
+            nvidia-smi | grep 'python' | awk '{ print $5 }' | xargs -n1 kill -9
+            sleep 5
+
+            ./bin/run_from_saved.sh $1 $2 $3 $4 $5 $6 $7 $8 $9 ${10} $i
+            status=$?
+
+            echo $status
+
+            nvidia-smi | grep 'python' | awk '{ print $5 }' | xargs -n1 kill -9
+            sleep 5
+
+            if [ $status -ne 0 ]; then
+                break
+            fi
+        fi
+    done
+
+./bin/val_from_saved.sh $1 $2 $3 $4 $5 $6 $7 $8 $9 ${10} $i
+
+nvidia-smi | grep 'python' | awk '{ print $5 }' | xargs -n1 kill -9
+sleep 5
diff --git a/bin/run_calf_debug.sh b/bin/run_calf_debug.sh
@@ -0,0 +1,32 @@
+for i in 1 2 3
+    do
+        if [ $i = "1" ]; then
+            ./bin/run_debug.sh $1 $2 $3 $4 $5 $6 $7 $8 $9 ${10} $i
+
+            nvidia-smi | grep 'python' | awk '{ print $5 }' | xargs -n1 kill -9
+            sleep 5
+        else
+            ./bin/train_debug.sh $1 $2 $3 $4 $5 $6 $7 $8 $9 ${10} $i
+
+            nvidia-smi | grep 'python' | awk '{ print $5 }' | xargs -n1 kill -9
+            sleep 5
+
+            ./bin/run_from_saved_debug.sh $1 $2 $3 $4 $5 $6 $7 $8 $9 ${10} $i
+
+            status=$?
+
+            echo $status
+
+            nvidia-smi | grep 'python' | awk '{ print $5 }' | xargs -n1 kill -9
+            sleep 5
+
+            if [ $status -ne 0 ]; then
+                break
+            fi
+        fi
+    done
+
+./bin/val_from_saved_debug.sh $1 $2 $3 $4 $5 $6 $7 $8 $9 ${10} $i
+
+nvidia-smi | grep 'python' | awk '{ print $5 }' | xargs -n1 kill -9
+sleep 5
diff --git a/bin/run_debug.sh b/bin/run_debug.sh
@@ -0,0 +1 @@
+CUDA_VISIBLE_DEVICES=${10} python3 -m src.run -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8_DEBUG/${9}/iter_${11} seed=${9} is_debug=True ws_iteration=${11}
diff --git a/bin/run_from_saved.sh b/bin/run_from_saved.sh
@@ -0,0 +1,3 @@
+exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8/${9}/iter_${11}
+
+CUDA_VISIBLE_DEVICES=${10} python3 -m src.run -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=${exp_name} seed=${9} load_weight=exp_out/${exp_name}/finish.pt ws_iteration=${11}
diff --git a/bin/run_from_saved_debug.sh b/bin/run_from_saved_debug.sh
@@ -0,0 +1,3 @@
+exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8_DEBUG/${9}/iter_${11}
+
+CUDA_VISIBLE_DEVICES=${10} python3 -m src.run -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=${exp_name} seed=${9} load_weight=exp_out/${exp_name}/finish.pt is_debug=True ws_iteration=${11}
diff --git a/bin/train.sh b/bin/train.sh
@@ -0,0 +1,3 @@
+export TOKENIZERS_PARALLELISM=true
+
+CUDA_VISIBLE_DEVICES=${10} python3 -m src.train -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8/${9}/iter_${11} seed=${9} ws_iteration=${11}
diff --git a/bin/train_debug.sh b/bin/train_debug.sh
@@ -0,0 +1,3 @@
+export TOKENIZERS_PARALLELISM=true
+
+CUDA_VISIBLE_DEVICES=${10} python3 -m src.train -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8_DEBUG/${9}/iter_${11} seed=${9} num_steps=50 is_debug=True ws_iteration=${11}
diff --git a/bin/val.sh b/bin/val.sh
@@ -0,0 +1 @@
+CUDA_VISIBLE_DEVICES=${10} python3 -m src.val -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8/${9}/iter_${11} seed=${9} ws_iteration=${11}
diff --git a/bin/val_debug.sh b/bin/val_debug.sh
@@ -0,0 +1 @@
+CUDA_VISIBLE_DEVICES=${10} python3 -m src.val -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8_DEBUG/${9}/iter_${11} seed=${9} is_debug=True ws_iteration=${11}
diff --git a/bin/val_from_saved.sh b/bin/val_from_saved.sh
@@ -0,0 +1,3 @@
+exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8/${9}/iter_${11}
+
+CUDA_VISIBLE_DEVICES=${10} python3 -m src.val -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=${exp_name} seed=${9} load_weight=exp_out/${exp_name}/finish.pt ws_iteration=${11}
diff --git a/bin/val_from_saved_debug.sh b/bin/val_from_saved_debug.sh
@@ -0,0 +1,3 @@
+exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8_DEBUG/${9}/iter_${11}
+
+CUDA_VISIBLE_DEVICES=${10} python3 -m src.val -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=${exp_name} seed=${9} load_weight=exp_out/${exp_name}/finish.pt is_debug=True ws_iteration=${11}
diff --git a/bin/val_from_saved_transfer.sh b/bin/val_from_saved_transfer.sh
@@ -0,0 +1,5 @@
+exp_name=${2}/TRANSFER_answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8/${9}/iter_${11}
+exp_name_weights=${12}/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_${13}__samples_$7__ws_$8/${9}/iter_${11}
+
+
+CUDA_VISIBLE_DEVICES=${10} python3 -m src.val -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=${exp_name} seed=${9} load_weight=exp_out/${exp_name_weights}/finish.pt ws_iteration=${11}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		mkdir models
		wget https://huggingface.co/yzha/AlignScore/resolve/main/AlignScore-large.ckpt -P models/
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		CUDA_VISIBLE_DEVICES=${10} python3 -m src.run -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8/${9}/iter_${11} seed=${9} ws_iteration=${11}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8/${9}/iter_${11}

		CUDA_VISIBLE_DEVICES=${10} python3 -m src.run -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=${exp_name} seed=${9} load_weight=exp_out/${exp_name}/finish.pt ws_iteration=${11}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		export TOKENIZERS_PARALLELISM=true

		CUDA_VISIBLE_DEVICES=${10} python3 -m src.train -c configs/answer_truncation/$1.json+configs/dataset/$2.json+configs/environment/$3.json+configs/focused_learning/$4.json+configs/model/$5.json+configs/retrieval/$6.json+configs/samples/$7.json+configs/ws_attribution_training/$8.json -k exp_name=$2/answer_truncation_$1__environment_$3__focused_learning_$4__model_$5__retriever_$6__samples_$7__ws_$8/${9}/iter_${11} seed=${9} ws_iteration=${11}