diff --git a/README.md b/README.md index f94840c..7eebfd7 100644 --- a/README.md +++ b/README.md @@ -11,19 +11,22 @@ Documentation + + Documentation + +

![](assets/overview_page1.png) **LooGLE** is a comprehensive evaluation benchmark for LLM long context understanding which contains up-to-date (all after 2022) and extremely long realistic documents (over 24k tokens per document, many of which exceed 100k words) and 6,000 newly generated questions spanning diverse domains and categories. Details statistics of our dataset can be seen in the table below. -**Short and long dependency tasks ๐Ÿ“œ** LooGLE is composed of 7 major tasks to evaluate LLMs' ability to understand both short and long dependency content. We refer to ``long dependency" tasks as those that require the understanding of the inter-dependency across multiple shreds of evidences widely spanning over the entire long text. We delicately design 5 types of long dependency tasks, including comprehension and reasoning, computation, timeline reorder, multiple information retrieval, and summarization. +**Short and long dependency tasks ๐Ÿ“œ** LooGLE is composed of 7 major tasks to evaluate LLMs' ability to understand both short and long dependency content. We refer to ``long dependency" tasks as those that require the understanding of the inter-dependency across multiple shreds of evidence widely spanning over the entire long text. We delicately design 5 types of long dependency tasks, including comprehension and reasoning, computation, timeline reorder, multiple information retrieval, and summarization. -**Long context evaluation ๐Ÿ“Š** In order to provide more comprehensive and general results, LooGLE relies on automatic automatic metrics based on semantic similarity, GPT4-as-judgment and human evaluation to get an overall performance for reference. We conduct the evaluation of 8 representative LLMs. We specifically select LLMs which have made great effort in addressing the challenge of understanding long contexts by utilizing flash attention, position interpolation, optimized Transformer and finetuning, external memory etc. +**Long context evaluation ๐Ÿ“Š** In order to provide more comprehensive and general results, LooGLE relies on automatic metrics based on semantic similarity, GPT4-as-judgment and human evaluation to get an overall performance for reference. We conduct the evaluation of 8 representative LLMs. We specifically select LLMs which have made great effort in addressing the challenge of understanding long contexts by utilizing flash attention, position interpolation, optimized Transformer and finetuning, external memory etc. -LooGLE not only provides a systematic and comprehensive evaluation schema on long-context LLMs, but also sheds light on future development of enhanced models towards โ€œtrue long-context understandingโ€. +LooGLE not only provides a systematic and comprehensive evaluation schema on long-context LLMs, but also sheds light on the future development of enhanced models toward โ€œtrue long-context understandingโ€. -
## ๐Ÿ“Œ **Statistics of LooGLE** @@ -40,14 +43,9 @@ LooGLE not only provides a systematic and comprehensive evaluation schema on lon - [**Prediction for retrieval based methods**](#prediction-for-retrieval-based-methods) - [๐Ÿ“Š **Evaluation**](#-evaluation) - [**Evaluation on Timeline reorder task**](#evaluation-on-timeline-reorder-task) -- [๐Ÿ’ก **Main result on short and long dependency tasks**](#-main-result-on-short-and-long-dependency-tasks) - - [**Performance of the short dependency tasks**](#performance-of-the-short-dependency-tasks) - - [**Performance of the long dependency tasks**](#performance-of-the-long-dependency-tasks) - - [**Impact of input length on long dependency tasks**](#impact-of-input-length-on-long-dependency-tasks) - [๐Ÿ“ **Citation**](#-citation) - [๐Ÿ“ฃ **Contacts**](#-contacts) -
## ๐Ÿš€ **Capability leaderboard** The overall performance comparisons of different models on different tasks in our dataset are shown in the figure below. @@ -104,7 +102,7 @@ To mention that, in long dependency QA data, we add an extra key `type` for each
### **Step 3. Generate the prediction results** -We test LLMs using 3 python codes under the path [Prediction/](Prediction/) for corresponding types of models. We select the model for evaluation via `--model_name` and the specific task via `--task`. Let's take short dependency QA as an example: +We test LLMs using 3 Python codes under the path [Prediction/](Prediction/) for corresponding types of models. We select the model for evaluation via `--model_name` and the specific task via `--task`. Let's take short dependency QA as an example: For GPT-3.5-turbo and GPT4: ``` @@ -121,33 +119,29 @@ For other open-source models (take chatglm2-6b-32k as an example): python Prediction/pred_opensource_models.py --model_name chatglm2-6b-32k --task shortdep_qa --max_length 500 ``` -Open-source models can be download and loaded from [Models/](Models/) by default, you can change the path via `--model_path` +Open-source models can be downloaded and loaded from [Models/](Models/) by default, you can change the path via `--model_path` -You can also determine the long texts output result through `--output_path`. +You can also determine the long text output result through `--output_path`. -Please note that in `config/`, we provide the prompt format suitable for each task and the maximum generation length. The input parameter `--max_length` limits the max length of input prompt for selcted model. Feel free to modify them to better suit the model you want to evaluate. +Please note that in `config/`, we provide the prompt format suitable for each task and the maximum generation length. The input parameter `--max_length` limits the max length of the input prompt for selected model. Feel free to modify them to better suit the model you want to evaluate. We test all the open-source baselines with a single 80G A800 GPU in BF16 precision. For Llama-2 based models, we recommend using [Flash Attention](https://github.com/Dao-AILab/flash-attention) for optimization and saving GPU memory. -
- -### **Prediction for retrieval based methods** +### **Prediction for retrieval-based methods** -To evaluate the effectiveness of retrieval techniques for long-context dependency questions, we undertook an extensive experiments by replacing the base LLM model in LlamaIndex with different baseline LLMs. +To evaluate the effectiveness of retrieval techniques for long-context dependency questions, we undertook extensive experiments by replacing the base LLM model in LlamaIndex with different baseline LLMs. -For retrieval based methods (take chatglm2-6b-32k as an example): +For retrieval-based methods (take chatglm2-6b-32k as an example): ``` python Retrieval/pred_retrieval_based_method.py --model_name chatglm2-6b-32k --task shortdep_qa --max_length 500 --emb_model_name sentence-transformers/all-mpnet-base-v2 ``` -Use `--emb_model_name` to set embedding models for retrieval based methods. Here we used all-mpnet-base-v2 as default. - -
+Use `--emb_model_name` to set embedding models for retrieval-based methods. Here we used all-mpnet-base-v2 as default. ## ๐Ÿ“Š **Evaluation** Given the prediction file generated in Step 2, we run the evaluation code in [Evaluation/](Evaluation/). -For automatic evaluation in short and long dependency QA, summarization task (eg. short dependency QA): +For automatic evaluation in short and long-dependency QA, summarization task (eg. short-dependency QA): ``` python Evaluation/automatic_eval.py --model_name chatglm2-6b-32k --task shortdep_qa --eval_metric automatic_sim @@ -169,530 +163,15 @@ Besides the parameters specifying the `--model_name` and `--task`, we provide `- Automatic metrics based on semantic similarity matching including Bleu, Rouge, Meteor, Bertscore and exact/partial match are supported. Feel free to add other metrics for your needs in [Evaluation/automatic_metrics.py](Evaluation/automatic_metrics.py). Besides, the prompt of GPT4 given in the repo can be altered for further evaluation. -
- ### **Evaluation on Timeline reorder task** We provide four metrics: LSD (location square deviation), LMD (location mean deviation), SD -(swap deviation), and SDD (swap distance deviation) to measure the similarity of numeric sequences for time reorder task with regularized outputs. Details of the implementations can be seen in our paper. +(swap deviation), and SDD (swap distance deviation) to measure the similarity of numeric sequences for time reorder tasks with regularized outputs. Details of the implementations can be seen in our paper. For LLM in long dependency timeline reorder task: ``` python Reorder/automatic_eval.py --model_name chatglm2-6b-32k ``` -
- -## ๐Ÿ’ก **Main result on short and long dependency tasks** - -### **Performance of the short dependency tasks** - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Models Context Short dependency QACloze
Bleu1 Bleu4 Rouge1 Rouge4 RougeL Meteor score Bert score GPT4 score Exact Match Partial Match
GPT4-32k32k24.6111.1461.8050.7360.7532.9478.7271.5270.5080.81
GPT4-8k8K27.3514.3867.5956.0165.7738.5687.9353.9966.0376.62
GPT3.5-turbo-16k16K22.679.6262.5648.6360.6632.5887.0466.8254.6463.42
LlamaIndex-33.3721.4358.8242.9357.0837.1786.5859.6158.9566.86
ChatGLM2-6B32k14.296.0720.5013.1620.3613.0887.2823.650.050.98
LongLLaMa-3B256k1.370.2626.9711.0226.1011.3471.6513.75-2.13
RWKV-4-14B-pile8k0.800.0421.706.3920.649.4170.428.93--
LLaMA2-7B-32K32k0.187.25*e-3081.860.001.861.5261.533.18-0.58
-
- -### **Performance of the long dependency tasks** - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Models Context Bleu1 Bleu4 Rouge1 Rouge4 RougeL Meteor score Bert score GPT4 score
arXiv paper summarization
GPT4-32k32k24.500.7327.157.1024.2519.0384.0482.84
GPT4-8k8k29.022.0932.0811.1128.8522.6484.9285.42
GPT3.5-turbo-16k16k28.701.5932.0410.6928.8922.3484.8286.84
LlamaIndex-22.530.6326.286.9723.7321.0783.0976.35
ChatGLM2-6B32k0.041.60e-3105.978.43E-055.826.4073.2513.23
LongLLaMa-3B256k4.249.32e-3094.100.523.863.8273.4112.28
RWKV-4-14B-pile8k6.284.58E-056.450.746.016.0075.287.02
LLaMA2-7B-32K32k0.034.66e-3100.120.000.120.6771.217.60
Long dependency QA
GPT4-32k32k8.551.4025.596.3624.0411.1380.1654.09
GPT4-8k8k8.941.0123.456.5721.6910.1885.3642.12
GPT3.5-turbo-16k16k6.921.8125.026.6823.6310.4083.7945.04
LlamaIndex-7.761.2423.627.1022.3010.4783.8737.63
ChatGLM2-6B32k5.550.119.411.938.694.3985.7811.50
LongLLaMa-3B256k1.043.12E-3072.960.032.711.6678.606.48
RWKV-4-14B-pile8k0.719.52E-30718.541.5517.693.4571.365.33
LLaMA2-7B-32K32k0.082.44E-3082.050.002.050.4650.284.18
- -
- -### **Impact of input length on long dependency tasks** - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Models Context Bleu1 Bleu4 Rouge1 Rouge4 RougeL Meteor score Bert score GPT4 score
arXiv paper summarization
GPT4-32k32k24.500.7327.157.1024.2519.0384.0482.84
GPT4-32k24k25.570.8127.617.5324.7319.8684.0783.15
GPT4-32k16k24.80.7027.297.2624.2819.1284.1182.82
GPT4-32k8k26.269.3527.837.6724.7420.0884.1082.75
GPT4-8k8k29.022.0932.0811.1128.8522.6484.9285.42
Long dependency QA
GPT4-32k32k7.641.2415.534.4614.6011.1286.0754.65
GPT4-32k24k8.231.6614.924.1213.9010.6086.1650.61
GPT4-32k16k8.571.3516.214.3014.9011.9186.3647.55
GPT4-32k8k7.461.7713.755.0812.8910.0185.7738.34
GPT4-8k8k8.941.0123.456.5721.6910.1885.3642.12
-
-