Blog Logo

Language models can explain neurons in language models

Language models have become more capable and broadly deployed but still have limited internal understanding. Interpretability research aims to uncover additional information by looking inside the model. A proposed automated process uses GPT-4 to score natural language explanations of neuron behavior. The scoring methodology helps measure the techniques effectiveness to improve poorly explained areas in the network. Although the explanations score poorly, ML techniques can improve with iterations, larger models, and architecture changes. The dataset released contains (imperfect) explanations and scores for every neuron in GPT-2.