NPTI

This repository is the official implementation of our paper "Neuron-based Personality Trait Induction in Large Language Models".

Dataset

To better identify personality-related neurons, we first constructed the PersonalityBench dataset, comprising 180,000 open-ended questions tailored to capture distinct personality traits based on Big Five personality theory. Specifically, we utilize the description from IPIP-NEO-300 questionnaire and common real-world topics introduced in UltraChat to generate the situational questions in PersonalityBench. The dataset is shown in NPTI/datset.

Personality-related Neurons Found by NPTI

Within a given layer, the FFN module can be expressed as:

$$h = \left( \sigma \left( \hat{h} W_1 \right) \odot \left(\hat{h} W_3 \right) \right)\cdot W_2,$$

where $\hat{\mathbf{h}}$ in $\mathbb{R}^{d}$ represents the output of the MHA module for a specific token in this layer. The function $\sigma(\cdot)$ typically denotes a non-linear activation function, such as SiLU. The learned projection matrices are $\mathbf{W}_1$ $\in$ $\mathbb{R}^{d \times d'}$, $\mathbf{W}_2$ $\in$ $\mathbb{R}^{d' \times d}$, and $\mathbf{W}_3$ $\in$ $\mathbb{R}^{d \times d'}$. In this context, a neuron is conceptualized as applying a linear transformation to a specific column of the weight matrix $\mathbf{W}_1$ followed by a non-linear activation function to the result. We identified neurons related to the positive or negative aspects of each personality trait. The neurons are listed in the NPTI/neuron_results directory. Each neuron is represented as a tuple, where each element corresponds to the following in order: (layer number, neuron index, absolute value of activation probability difference, activation probability, 95th percentile value).

Identifying Language-specific Neurons

To find personality-related neurons in LLaMA-8B-Instruct, you can excute:

bash NPTI/code/search_neuron.sh

Using NPTI to modify the personality of LLM

To modify certain personality trait of LLM using NPTI, you can excute:

bash NPTI/code/answer_question_change_neuron.sh

Using ChatGPT to score the results:

To use ChatGPT to automatically assess the degree of expression of a specific personality trait and the fluency of each response, you can excute:

python NPTI/code/gpt4_score.py

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
NPTI		NPTI
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NPTI

Dataset

Personality-related Neurons Found by NPTI

Identifying Language-specific Neurons

Using NPTI to modify the personality of LLM

Using ChatGPT to score the results:

About

Releases

Packages

Languages

RUCAIBox/NPTI

Folders and files

Latest commit

History

Repository files navigation

NPTI

Dataset

Personality-related Neurons Found by NPTI

Identifying Language-specific Neurons

Using NPTI to modify the personality of LLM

Using ChatGPT to score the results:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages