GPT-4 Shows Comparable Performance to Human Examiners in Ranking Open-Text Answers

Accompanying code for the paper "Can GPT-4 Replace Human Examiners? A Competition on Checking Open-Text Answers" by the authors:

pre-print: https://doi.org/10.21203/rs.3.rs-4431780/v1

Zubaer, Abdullah Al; Granitzer, Michael;
Geschwind, Stephan; Graf Lambsdorff, Johann; Voss, Deborah

Affiliation (all authors):  University of Passau, Passau, Germany.

Dataset DOI:

Inter-Rater Reliability Evaluation

This repository includes Python scripts designed to evaluate Inter-Rater Reliability (IRR) using data from specific files, saving the results in a designated folder. Additionally, it provides scripts to generate rank and point assessments using GPT-4 for all the prompts mentioned in our paper.

It consists of two Python scripts:

create_rank_point_gpt4.py: Generates rank and point data using GPT-4 models.
rank_point_assessment.py: Evaluates IRR based on generated data or original data.

Installation

Tested on Ubuntu 22.04.4 LTS.

To set up the project, you will need to have Anaconda installed on your system. If you don't have it installed, you can download it from here.

For setting up OpenAI GPT-4 API, you need to have an API key. You can get it from here and to set up the API key as environment variable, you can follow the instructions from here. The key must be set up in helper_functions.py file like this, openai.api_key = os.getenv("OPENAI_API_KEY")

Create a conda environment

conda create -n <env_name> python=3.10
conda activate <env_name>

Clone the Repository:

git clone https://github.com/abdullahalzubaer/Can-GPT-4-Replace-Human-Examiners-A-Competition-on-Checking-Open-Text-Answers.git

cd Can-GPT-4-Replace-Human-Examiners-A-Competition-on-Checking-Open-Text-Answers

Install Dependencies Install the required Python packages using pip:
```
pip install -r requirements.txt
```

Usage

The two scripts are used sequentially or independently, depending on the data you have. Below are instructions for each scenario.

1. Working with Original Data

If you want to evaluate the original data directly without generating any new ranks or points using GPT-4, use the following command:

The data must be present in ./original_data/ directory.

1.1 Rank and Point Assessment for Per-Question IRR:

python rank_point_assessment.py -at rank point -pq

This will calculate the rank and point assessments for per-question IRR and save results in ./original_data_results/... directories.

1.2 Rank and Point Assessment Without Per-Question IRR:

python rank_point_assessment.py -at rank point

This calculates the rank and point assessments and saves pooled results in ./original_data_results/... directories.

2. Generating and Evaluating New Ranks and Points Using GPT-4.

To generate new ranks and points using GPT-4, follow these steps:

2.1 Run the script to generate ranks and points:

python create_rank_point_gpt4.py

The generated data will be saved in the ./newly_generated_data/ directory.

Note: In rare instances, the GPT-4-generated data may not be perfectly parseable due to the non-deterministic nature of the model. You might encounter parsing issues, which can result in incomplete data. In such cases, it might be necessary to manually inspect and modify the data by reviewing the metadata file metadata_gpt4_ranks_points to correct the rank and point information in the data_gpt4_ranks_points file.

2.2 Evaluate the generated data using the rank and point assessment script:

2.2.1 Rank and Point Assessment with Per-Question IRR:

python rank_point_assessment.py -at rank point -pq -nd

This will calculate the rank and point assessments for per-question IRR and save results. in ./newly_generated_data_results/...

2.2.2 Rank and Point Assessment Without Per-Question IRR:

python rank_point_assessment.py -at rank point -nd

This calculates the rank and point assessments and saves pooled results in ./newly_generated_data_results/...

License

This code is licensed under the Apache-2.0 license. See the LICENSE file for details. Dataset licence mentioned here:

Citation

If you use our dataset or code, please cite the data source and our paper. Proper citation helps to ensure continued support for the project and acknowledges the work of the authors.

Dataset Citation:

@dataset{zubaer_2024_11085379,
  author       = {Zubaer, Abdullah Al and
                  Granitzer, Michael and
                  Geschwind, Stephan and
                  Graf Lambsdorff, Johann and
                  Voss, Deborah},
  title        = {{GPT-4 Shows Comparable Performance to Human 
                   Examiners in Ranking Open-Text Answers}},
  month        = apr,
  year         = 2024,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.11085379},
  url          = {https://doi.org/10.5281/zenodo.11085379}
}

Paper Citation [pre-print]:

Abdullah Al Zubaer, Michael Granitzer, Stephan Geschwind et al.
Can GPT-4 Replace Human Examiners? A Competition on Checking Open-Text Answers,
10 July 2024, PREPRINT (Version 1) available at Research Square
[https://doi.org/10.21203/rs.3.rs-4431780/v1]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT-4 Shows Comparable Performance to Human Examiners in Ranking Open-Text Answers

Inter-Rater Reliability Evaluation

Installation

Usage

1. Working with Original Data

1.1 Rank and Point Assessment for Per-Question IRR:

1.2 Rank and Point Assessment Without Per-Question IRR:

2. Generating and Evaluating New Ranks and Points Using GPT-4.

2.1 Run the script to generate ranks and points:

2.2 Evaluate the generated data using the rank and point assessment script:

2.2.1 Rank and Point Assessment with Per-Question IRR:

2.2.2 Rank and Point Assessment Without Per-Question IRR:

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
original_data		original_data
LICENSE		LICENSE
README.md		README.md
create_rank_point_gpt4.py		create_rank_point_gpt4.py
helper_functions.py		helper_functions.py
rank_point_assessment.py		rank_point_assessment.py
requirements.txt		requirements.txt

License

padas-lab-de/GPT-4-Shows-Comparable-Performance-to-Human-Examiners-in-Ranking-Open-Text-Answers

Folders and files

Latest commit

History

Repository files navigation

GPT-4 Shows Comparable Performance to Human Examiners in Ranking Open-Text Answers

Inter-Rater Reliability Evaluation

Installation

Usage

1. Working with Original Data

1.1 Rank and Point Assessment for Per-Question IRR:

1.2 Rank and Point Assessment Without Per-Question IRR:

2. Generating and Evaluating New Ranks and Points Using GPT-4.

2.1 Run the script to generate ranks and points:

2.2 Evaluate the generated data using the rank and point assessment script:

2.2.1 Rank and Point Assessment with Per-Question IRR:

2.2.2 Rank and Point Assessment Without Per-Question IRR:

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages