Model Drift

This repository provides the code for plotting persona drift in LLM-based chatbots, as discussed in Measuring and Controlling Persona Drift in Language Model Dialogs.

Abstract

Prompting is a standard tool for customizing language-model chatbots, enabling them to take on a specific "persona". An implicit assumption in the use of prompts is that they will be stable, so the chatbot will continue to generate text according to the stipulated persona for the duration of a conversation. We propose a quantitative benchmark to test this assumption, evaluating persona stability via self-chats between two personalized chatbots. Testing popular models like LLaMA2-chat-70B, we reveal a significant persona drift within eight rounds of conversations. An empirical and theoretical analysis of this phenomenon suggests the transformer attention mechanism plays a role, due to attention decay over long exchanges. To combat attention decay and persona drift, we propose a lightweight method called split-softmax, which compares favorably against two strong baselines.

Installation

To install:

conda env create -f environment.yml
conda activate drift
python -m ipykernel install --user --name drift --display-name "drift"

Generating Self-Chats

For example, python run.py --model_name llama2_chat_70B --agent -1 --user -1 --turns 8 --seed 1 --runs 2 generates an episode of self-chat between two copies of llama2_chat_70B, the personas of the two are randomly (with --seed 1) sampled from 100 personas defined by us here. The conversation will go for 8 (--turns) turns (or 4 rounds). At each turn for the agent (2, 4, ..., 8), the probe question is asked 2 (--runs) times. Results will be saved into selfchat folder.

Note that the model can be from HuggingFace or API calls like --model_name gpt-3.5-turbo-16k. The code is easily hackable so that you can swap in your locally built model.

You can also skip this process by downloading self-chat histories from this google drive and put them into selfchat folder.

Plotting Persona Drift

Please check out plot_convergence.ipynb.

How to Cite

@article{li2024measuring,
  title={Measuring and Controlling Persona Drift in Language Model Dialogs},
  author={Li, Kenneth and Liu, Tianle and Bashkansky, Naomi and Bau, David and Vi{\'e}gas, Fernanda and Pfister, Hanspeter and Wattenberg, Martin},
  journal={arXiv preprint arXiv:2402.10962},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
selfchat		selfchat
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
hundred_system_prompts.py		hundred_system_prompts.py
plot_convergence.ipynb		plot_convergence.ipynb
run.py		run.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Model Drift

Abstract

Installation

Generating Self-Chats

Plotting Persona Drift

How to Cite

About

Releases

Packages

Languages

likenneth/persona_drift

Folders and files

Latest commit

History

Repository files navigation

Model Drift

Abstract

Installation

Generating Self-Chats

Plotting Persona Drift

How to Cite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages