Researchers Map How Model Circuits Hold Knowledge
Attacks
Researchers probe how factual knowledge is organised inside model 'circuits' and why that matters for security and maintenance of Large Language Models (LLMs). The work examines extracted circuits from a GPT-2 model and performs controlled knowledge edits to see what holds and what yields when you try to change it.
What they did
The team converts automatically extracted circuits into generative tasks and trains a differentiable mask that isolates circuit parameters. They attempt targeted edits using a curated set of facts (Hierarchy Data) and other benchmarks including linguistic tests such as BLiMP. Experiments compare small circuits (about 5% of parameters) to larger ones (about 50%) and to the full model.
What they found
Circuits do carry structured knowledge: smaller, sparse circuits often resist edits more than their complementary parts, suggesting compact storage of some facts. LayerNorm parameters feature heavily in those tight circuits, challenging the assumption that attention or MLP layers alone store knowledge. Complementary subnetworks prove easier to edit, while the intact model still integrates edits best when all parts interact.
The practical upshot is double-edged. On the defensive side, mapping circuits offers a route to integrity monitoring and surgical bias mitigation by targeting complementary components without changing core behaviour. On the offensive side, the same understanding could make it easier to inject targeted misbehaviour or backdoors if attackers learn which subnetworks to alter.
Limitations matter. The work uses GPT-2 and a specific editing method, so results may not generalise to much larger models or other architectures. The exact size of an 'ideal' circuit remains exploratory.
For practitioners this yields pragmatic steps: maintain versioned model snapshots, monitor parameter-level deviations, and invest in interpretability tooling that highlights LayerNorm and complementary subnetworks. Research should prioritise replication on larger models, standardised integrity checks, and safe editing protocols that preserve audit trails.
The findings cut through both hype and alarm: circuits create real attack and defence surfaces, but they also offer practical levers for safer, more verifiable model updates.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Understanding Language Model Circuits through Knowledge Editing
🔍 ShortSpan Analysis of the Paper
Problem
This work investigates how knowledge is structured within circuits, compact subnetworks extracted from a GPT-2 language model that reproduce the full model’s behaviour, and how targeted edits to that knowledge affect model outputs. Understanding the organisation of knowledge in these circuits matters for interpretability and safety, including how models might be tampered with or defended against tampering. The authors aim to characterise knowledge storage, distribution across network components, and the architectural composition of knowledge bearing circuits to shed light on what these circuits mean for model reliability and safety research.
Approach
The study uses automatically extracted circuits from GPT-2 and converts them to generation tasks by replacing the classification head with a language model head, enabling knowledge edits within circuits. A differentiable masking technique assigns a mask mθ in the range [0,1] to each parameter, trained with a faithfulness loss that aligns the circuit's outputs with the original model and a sparsity loss that promotes sparsity via a hard concrete distribution and a straight through estimator. Edits target the knowledge embedded in circuits through fine tuning, while parameters outside the circuit are masked to zero. The knowledge editing data come from Hierarchy Data, a set of 198 facts formatted for editing analysis, and a suite of datasets spanning hierarchical concept knowledge, linguistic knowledge, and safety behaviours, including eight BLiMP datasets and five safety datasets. Experiments focus on circuits comprising 50 per cent and 5 per cent of model parameters, with additional analyses at 15 per cent, 25 per cent and 35 per cent in supplementary material. Performance is assessed using the negative log probability of the edited target (lower is better), as well as pre edit negative log probabilities of true targets to gauge baseline knowledge, and comparisons against complementary circuits and the full model. Datasets are processed on GPUs, and the approach relies on a fine tuning objective that preserves task alignment while injecting new knowledge. The authors also compare circuits across tasks to explore overlap and the distribution of knowledge across layers, particularly noting LayerNorm contributions.
Key Findings
- There is a confirmation bias in editing: knowledge intensive circuits show stronger resistance to edits than their complementary parts, indicating structured information storage within circuits.
- The ideal knowledge bearing circuit appears not to be all the model nor a single small subset; evidence suggests a size between 5 per cent and 50 per cent of parameters, with 5 per cent often showing greater resistance to edits and possibly closer to an ideal circuit, while 50 per cent circuits may include non knowledge bearing noise.
- Pre edit measurements show that hierarchy related circuits store knowledge more effectively than their complementary circuits, with lower negative log probabilities for true targets; this pattern generalises across different circuit types and tasks, including a linguistic dataset and a syntax based task, implying spill over effects between circuits.
- Post edit analysis indicates that complementary circuits are easier to edit than primary circuits, and the full model still yields the best edit performance when all sub networks interact, suggesting synergy between circuits and their complements during knowledge injection.
- Across datasets, 5 per cent circuits display overlaps with other tasks that exceed random expectations, typically 13 to 18 per cent, while 50 per cent circuits show even higher cross task overlap; some datasets such as risk aversion show broader overlap, whereas extraversion overlaps are smaller, pointing to dataset dependent organisation of knowledge within circuits.
- LayerNorm dominates circuit parameter share, especially in 5 per cent circuits, while attention and MLP layers contribute less per circuit but remain important; this challenges assumptions about where knowledge is stored and highlights normalization layers as potentially central to knowledge organisation and circuit stability.
- There is evidence of knowledge spillover between syntax based circuits and knowledge based problems, indicating that circuits solving one type of task may participate in storing knowledge relevant to others, informing interpretability and safety considerations.
- Complementary circuits offer a potential strategy for bias mitigation: editing complementary circuits without modifying primary behavioural circuits could reduce undesired behaviours while preserving performance, subject to further validation.
- The study provides open source data and code, enabling replication and extension to other models and editing methods, and outlines future work on ideal circuit identification, complementary circuit editing, and broader dataset coverage for safety and reliability improvements.
Limitations
The work focuses on the GPT-2 base model and a generation oriented revision of circuits, using a specific set of datasets and a particular editing framework. It relies on a parameter level circuit analysis and may not generalise to larger LLMs or to other model architectures or editing techniques. The precise size and existence of an ideal circuit remain exploratory, and further work is needed to confirm generalisability across tasks, models, and real world deployment scenarios.
Why It Matters
Mapping and editing the information stored in model circuits could enable more robust integrity monitoring and safer knowledge updates in deployed language models. Understanding circuit structure informs how edits might be detected or mitigated and suggests strategies such as targeting complementary circuits to reduce biases without impairing core behaviour. The findings advance interpretability by clarifying how information is distributed and stabilised within circuits, with direct relevance to AI safety, bias mitigation, and the prevention of targeted misbehaviour or backdoors. The work also highlights practical considerations for safeguarding knowledge in LLMs and informs future research aimed at safer, verifiable model updates.