Accompanying code for the paper "Can GPT-4 Replace Human Examiners? A Competition on Checking Open-Text Answers" by the authors:
pre-print: https://doi.org/10.21203/rs.3.rs-4431780/v1
Zubaer, Abdullah Al; Granitzer, Michael;
Geschwind, Stephan; Graf Lambsdorff, Johann; Voss, Deborah
Affiliation (all authors): University of Passau, Passau, Germany.
This repository includes Python scripts designed to evaluate Inter-Rater Reliability (IRR) using data from specific files, saving the results in a designated folder. Additionally, it provides scripts to generate rank and point assessments using GPT-4 for all the prompts mentioned in our paper.
It consists of two Python scripts:
create_rank_point_gpt4.py
: Generates rank and point data using GPT-4 models.rank_point_assessment.py
: Evaluates IRR based on generated data or original data.
Tested on Ubuntu 22.04.4 LTS.
To set up the project, you will need to have Anaconda installed on your system. If you don't have it installed, you can download it from here.
For setting up OpenAI GPT-4 API, you need to have an API key. You can get it from here and to set up the API key as environment variable, you can follow the instructions from here. The key must be set up in helper_functions.py file like this,
openai.api_key = os.getenv("OPENAI_API_KEY")
-
Create a conda environment
conda create -n <env_name> python=3.10 conda activate <env_name>
-
Clone the Repository:
git clone https://github.com/abdullahalzubaer/Can-GPT-4-Replace-Human-Examiners-A-Competition-on-Checking-Open-Text-Answers.git cd Can-GPT-4-Replace-Human-Examiners-A-Competition-on-Checking-Open-Text-Answers
-
Install Dependencies Install the required Python packages using pip:
pip install -r requirements.txt
The two scripts are used sequentially or independently, depending on the data you have. Below are instructions for each scenario.
If you want to evaluate the original data directly without generating any new ranks or points using GPT-4, use the following command:
The data must be present in
./original_data/
directory.
python rank_point_assessment.py -at rank point -pq
This will calculate the rank and point assessments for per-question IRR and save results in ./original_data_results/...
directories.
python rank_point_assessment.py -at rank point
This calculates the rank and point assessments and saves pooled results in ./original_data_results/...
directories.
To generate new ranks and points using GPT-4, follow these steps:
python create_rank_point_gpt4.py
The generated data will be saved in the ./newly_generated_data/
directory.
Note: In rare instances, the GPT-4-generated data may not be perfectly parseable due to the non-deterministic nature of the model. You might encounter parsing issues, which can result in incomplete data. In such cases, it might be necessary to manually inspect and modify the data by reviewing the metadata file
metadata_gpt4_ranks_points
to correct the rank and point information in thedata_gpt4_ranks_points
file.
python rank_point_assessment.py -at rank point -pq -nd
This will calculate the rank and point assessments for per-question IRR and save results. in ./newly_generated_data_results/...
python rank_point_assessment.py -at rank point -nd
This calculates the rank and point assessments and saves pooled results in ./newly_generated_data_results/...
License
This code is licensed under the Apache-2.0 license. See the LICENSE file for details. Dataset licence mentioned here:
If you use our dataset or code, please cite the data source and our paper. Proper citation helps to ensure continued support for the project and acknowledges the work of the authors.
Dataset Citation:
@dataset{zubaer_2024_11085379,
author = {Zubaer, Abdullah Al and
Granitzer, Michael and
Geschwind, Stephan and
Graf Lambsdorff, Johann and
Voss, Deborah},
title = {{GPT-4 Shows Comparable Performance to Human
Examiners in Ranking Open-Text Answers}},
month = apr,
year = 2024,
publisher = {Zenodo},
doi = {10.5281/zenodo.11085379},
url = {https://doi.org/10.5281/zenodo.11085379}
}
Paper Citation [pre-print]:
Abdullah Al Zubaer, Michael Granitzer, Stephan Geschwind et al.
Can GPT-4 Replace Human Examiners? A Competition on Checking Open-Text Answers,
10 July 2024, PREPRINT (Version 1) available at Research Square
[https://doi.org/10.21203/rs.3.rs-4431780/v1]