Transcoder-circuits: reverse-engineering LLM circuits with transcoders

This repository contains tools for understanding what's going on inside large language models by using a tool called "transcoders". Transcoders decompose MLP sublayers in transformer models into a sparse linear combination of interpretable features. By using transcoders, we can reverse-engineer fine-grained circuits of features within the model.

To get started, we recommend working through the walkthrough.ipynb notebook. The full structure of the repository is as follows:

Installation tools:
- setup.sh: A shell script for installing dependencies and downloading transcoder weights.
- requirements.txt: The standard Python dependencies list.
Examples:
- walkthrough.ipynb: A walkthrough notebook that demonstrates how to use the tools provided in this repository for reverse-engineering LLM circuits with transcoders.
- train_transcoder.py: An example script for training a transcoder.
Experimental results;
- sweep.ipynb: An evaluation of transcoders and SAEs trained on Pythia-410M (weights not provided).
- interp-comparison.ipynb: Code for performing a blind comparison of SAE and transcoder feature interpretability
Case studies:
- case_study_citations.ipynb: An example of a reverse-engineering case study that we carried out, in which we investigated a transcoder feature that activates on semicolons in parenthetical citations.
- case_study_caught.ipynb: An example of a reverse-engineering case study that we carried out, in which we investigated a transcoder feature that activates on the verb "caught".
- case_study_local_context.ipynb: An example of a reverse-engineering case study that we carried out, in which we attempted to reverse-engineer a circuit that computes a harder-to-interpret transcoder feature. (We were less successful in this case study, but are including it in the interest of transparency.)
- restricted blind case studies.ipynb: A notebook containing a set of "restricted blind case studies" that reverse-engineer random GPT2-small transcoder features without looking at MLP0 transcoder feature activations.
Libraries:
- sae_training/: Code for training and using transcoders. The code is largely based on an older version of Joseph Bloom's excellent SAE repository -- shoutouts to him!. (The misnomer sae_training is a vestige of this origin of the code.)
- transcoder_circuits/: Code for reverse-engineering and analyzing circuits with transcoders. These are the tools that we use in the walkthrough notebook and in the case studies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Transcoder-circuits: reverse-engineering LLM circuits with transcoders

Files

README.md

Latest commit

History

README.md

File metadata and controls

Transcoder-circuits: reverse-engineering LLM circuits with transcoders