Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easily evaluate models steered by SAEs #2641

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

AMindToThink
Copy link

Sparse Autoencoders are a popular technique for understanding what is going on inside large language models. Recently, researchers have started using them to steer model outputs by going directly into the "brains" of the models and editing their "thoughts" — called "features".

In this branch, I made it so that it is easy to evaluate SAE-steered models with EleutherAI's evaluation harness.

The target workflow is:

  1. The researcher does setup as normal. I included the relevant packages in the pyproject.toml.
  2. I've added a new model type, sae_steered_beta. The researcher will run lm_eval with --model sae_steered_beta.
  3. Sae_steered_beta also takes two model args:
          1. the base_name of a model which is has pretrained SAEs available on sae_lens.
          2. The csv which defines the SAE edits. An example can be found in examples/dog_steer.csv. Neuronpedia is a helpful resource. Here's the page for the feature chosen in dog_steer.csv. The columns of the csv are:
                latent_idx represents the component of the SAE which represents the feature. steering_coefficient is the amount of that concept to add to that location of the model. sae_release and sae_id refer to which sparse autoencoder to use on the model. description does not impact the code and can contain arbitrary human-readable comments about the featuers.
  4. Run an lm_eval command as normal. The following example command was run from a 'results' folder.
    lm_eval --model sae_steered_beta --model_args base_name=google/gemma-2-2b,csv_path=/home/cs29824/matthew/lm-evaluation-harness/examples/dog_steer.csv --tasks mmlu_abstract_algebra --batch_size auto --output_path . --device cuda:0

The reason the registered model is sae_steered_beta instead of sae_steered is that although the evaluation works, there are aspects which are still in development.

  • Conditional steering is not yet implemented.
  • I'm new to both SAE_Lens and EleutherAI's evaluation harness, and I don't know whether I implemented everything properly
  • Only loglikelihood works for the sae_steered_beta models.

Despite these issues, this is a feature which will be helpful for my reasearch, and I expect it to be valuable to others as well.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


AMindToThink seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants