Easily evaluate models steered by SAEs #2641

AMindToThink · 2025-01-21T01:05:35Z

Sparse Autoencoders are a popular technique for understanding what is going on inside large language models. Recently, researchers have started using them to steer model outputs by going directly into the "brains" of the models and editing their "thoughts" — called "features".

In this branch, I made it so that it is easy to evaluate SAE-steered models with EleutherAI's evaluation harness.

The target workflow is:

The researcher does setup as normal. I included the relevant packages in the pyproject.toml.
I've added a new model type, sae_steered_beta. The researcher will run lm_eval with --model sae_steered_beta.
Sae_steered_beta also takes two model args:
      1. the base_name of a model which is has pretrained SAEs available on sae_lens.
      2. The csv which defines the SAE edits. An example can be found in examples/dog_steer.csv. Neuronpedia is a helpful resource. Here's the page for the feature chosen in dog_steer.csv. The columns of the csv are:
            latent_idx represents the component of the SAE which represents the feature. steering_coefficient is the amount of that concept to add to that location of the model. sae_release and sae_id refer to which sparse autoencoder to use on the model. description does not impact the code and can contain arbitrary human-readable comments about the featuers.
Run an lm_eval command as normal. The following example command was run from a 'results' folder.
lm_eval --model sae_steered_beta --model_args base_name=google/gemma-2-2b,csv_path=/home/cs29824/matthew/lm-evaluation-harness/examples/dog_steer.csv --tasks mmlu_abstract_algebra --batch_size auto --output_path . --device cuda:0

The reason the registered model is sae_steered_beta instead of sae_steered is that although the evaluation works, there are aspects which are still in development.

Conditional steering is not yet implemented.
I'm new to both SAE_Lens and EleutherAI's evaluation harness, and I don't know whether I implemented everything properly
Only loglikelihood works for the sae_steered_beta models.

Despite these issues, this is a feature which will be helpful for my reasearch, and I expect it to be valuable to others as well.

CLAassistant · 2025-01-21T01:05:40Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

AMindToThink seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

…m hooks list; now only csv

AMindToThink added 4 commits January 20, 2025 19:10

Can evaluate sae_steered_beta models

98be835

Can evaluate sae_steered_beta models add

84a2fef

Ran pre-comit and pytest

11b5068

Add larger description for feature in dog_steer.csv

505015b

AMindToThink requested review from baberabb and lintangsutawika as code owners January 21, 2025 01:05

AMindToThink added 4 commits January 21, 2025 20:56

Cache the repeated SAE instead of making new every time.

3c2f764

Types of hook in InterventionModel

88d3d4c

Other steer functions. Deprecated InterventionModel instantiation fro…

0543c45

…m hooks list; now only csv

Other hook functions

8bf726f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Easily evaluate models steered by SAEs #2641

Easily evaluate models steered by SAEs #2641

AMindToThink commented Jan 21, 2025

CLAassistant commented Jan 21, 2025

Easily evaluate models steered by SAEs #2641

Are you sure you want to change the base?

Easily evaluate models steered by SAEs #2641

Conversation

AMindToThink commented Jan 21, 2025

CLAassistant commented Jan 21, 2025