Easily evaluate models steered by SAEs #2641
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Sparse Autoencoders are a popular technique for understanding what is going on inside large language models. Recently, researchers have started using them to steer model outputs by going directly into the "brains" of the models and editing their "thoughts" — called "features".
In this branch, I made it so that it is easy to evaluate SAE-steered models with EleutherAI's evaluation harness.
The target workflow is:
1. the base_name of a model which is has pretrained SAEs available on sae_lens.
2. The csv which defines the SAE edits. An example can be found in examples/dog_steer.csv. Neuronpedia is a helpful resource. Here's the page for the feature chosen in dog_steer.csv. The columns of the csv are:
latent_idx represents the component of the SAE which represents the feature. steering_coefficient is the amount of that concept to add to that location of the model. sae_release and sae_id refer to which sparse autoencoder to use on the model. description does not impact the code and can contain arbitrary human-readable comments about the featuers.
lm_eval --model sae_steered_beta --model_args base_name=google/gemma-2-2b,csv_path=/home/cs29824/matthew/lm-evaluation-harness/examples/dog_steer.csv --tasks mmlu_abstract_algebra --batch_size auto --output_path . --device cuda:0
The reason the registered model is sae_steered_beta instead of sae_steered is that although the evaluation works, there are aspects which are still in development.
Despite these issues, this is a feature which will be helpful for my reasearch, and I expect it to be valuable to others as well.