Skip to main content
Glama
geometry_aligment_mainfold_article.txt198 kB
[1]Transformer Circuits Thread When Models Manipulate Manifolds: The Geometry of a Counting Task When Models Manipulate Manifolds: The Geometry of a Counting Task [g9bPv60XpuY%0AFAAAAABJRU5ErkJggg==] Authors Wes Gurnee^*, Emmanuel Ameisen^*, Isaac Kauvar, Julius Tarng, Adam Pearce, Chris Olah, Joshua Batson^*‡ Affiliations [2]Anthropic Published October 21st, 2025 * Core Research Contributor; ‡ Correspondence to [3]joshb@anthropic.com Authors Affiliations Published Not published yet. DOI No DOI yet. Contents [4]Introduction [5]Representing Character Count [6]Sensing the Line Boundary [7]Predicting the Newline [8]A Distributed Character Counting Algorithm [9]Visual Illusions [10]Related Work [11]Discussion [12]Introduction Intelligent systems need perception to understand, predict, and navigate their environment. These sensory capabilities reflect what's useful for survival in a specific environment: bats use echolocation, migratory birds sense magnetic fields, Arctic reindeer shift their UV vision seasonally. But when your world is made of text, what do you see? Language models encounter many text-based tasks that benefit from visual or spatial reasoning: parsing ASCII art, interpreting tables, or handling text wrapping constraints. Yet their only “sensory” input is a sequence of integers representing tokens. They must learn perceptual abilities from scratch, developing specialized mechanisms in the process. In this work, we investigate the mechanisms that enable Claude 3.5 Haiku to perform a natural perceptual task which is common in pretraining corpora and involves tracking position in a document. We find learned representations of position that are in some ways quite similar to the biological neurons found in mammals who perform analogous tasks (“place cells” and “boundary cells” in mice), but in other ways unique to the constraints of the residual stream in language models. We study these representations and find dual interpretations: we can understand them as a family of discrete features or as a one-dimensional “feature manifold”/“multidimensional feature” * Feature Manifold Toy Model  [13][link] C. Olah, J. Batson. 2023. * What is a Linear Representation? What is a Multidimensional Feature?  [14][link] C. Olah. 2024. * Curve Detector Manifolds in InceptionV1  [15][link] L. Gorton. 2024. * Not All Language Model Features Are One-Dimensionally Linear  [16][link] J. Engels, E.J. Michaud, I. Liao, W. Gurnee, M. Tegmark. The Thirteenth International Conference on Learning Representations. 2025. [1, 2, 3, 4] . ^1 All features have a magnitude dimension; so a discrete feature is a one-dimensional ray, and a one-dimensional feature manifold is the set of all scalings of that manifold, contracting to the origin. See [17]What is a Linear Representation? What is a Multidimensional Feature? In the first interpretation, position is determined by which features activate and how strongly; in the latter interpretation, it's determined by angular movement on the feature manifold. Similarly, computation has two dual interpretations, as discrete circuits or geometric transformations. The task we study is linebreaking in fixed-width text. When training on source code, chat logs, email archives, scanned articles, or judicial rulings that have line width constraints, how does the model learn to predict when to break a line? ^2 Michaud et al. looked for “quanta” of model skills by clustering gradients * The Quantization Model of Neural Scaling  [18][link] E.J. Michaud, Z. Liu, U. Girit, M. Tegmark. Thirty-seventh Conference on Neural Information Processing Systems. 2023. [5] . Their Figure 1 shows that predicting newlines in fixed-width text formed one of the top 400 clusters for the smallest model in the Pythia family, with 70m parameters. Human visual perception lets us do this almost completely subconsciously – when writing a birthday card, you can see when you are out of room on a line and need to begin the next – but language models see just a list of integers. In order to correctly predict the next token, in addition to selecting the next word, the model must somehow count the characters in the current line, subtract that from the line width constraint of the document, and compare the number of characters remaining to the length of the next word. As a concrete example, consider the below pair of prompts with an implicit 50-character line wrapping constraint. ^3 The wrapping constraint is implicit. Each newline gives a lower bound (the previous word did fit) and an upper bound (the next word did not). We do not nail down the extent to which the model performs optimal inference with respect to those constraints, rather focusing on how it approximately uses the length of each preceding line to determine whether to break the next. There are also many edge cases for handling tokenization and punctuation. A model could even attempt to infer whether the source document used a non-monospace font and then use the pixel count rather than the character count as a predictive signal! When the next word fits, the model says it; when it does not, the model breaks the line: [H936qFqm3ccQQAAAABJRU5ErkJggg==] To orient ourselves to the stages of the computation, we first studied the model using discrete dictionary features. In this frame, we can understand computation as an “attribution graph” * Circuit Tracing: Revealing Computational Graphs in Language Models  [19][HTML] E. Ameisen, J. Lindsey, A. Pearce, W. Gurnee, N.L. Turner, B. Chen, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. McDougall, H. Cunningham, T. Henighan, A. Jermyn, A. Jones, A. Persic, Z. Qi, T. Ben Thompson, S. Zimmerman, K. Rivoire, T. Conerly, C. Olah, J. Batson. Transformer Circuits. 2025. [6] where a cascade of features excite or inhibit each other. ^4 We actually first tried to use patching and probing without looking at the graph as a kind of methodological test of the utility of features, but did not make much progress. In hindsight, we were training probes for quantities different than the ones the model represents cleanly, e.g., a fusion of the current token position and the line width. [L9wvsWN6+bDgAA%0AgLNqeZ+ZmZmZmZmZmZmZmZmZmZnNvr8BTmNb4y6o2qMAAAAASUVOR K5CYII=] Attribution Graph for Claude 3.5 Haiku’s prediction of a newline in the aluminum prompt. We see features relating to “width of the previous line” and “position in the current line” which together activate features for “distance from line limit”. Combined with features for the planned next word, these features activate “predict newline” features. The attribution graph shows how the model performs this task by combining features that represent different concepts it needs to track: 1. Features for the current position in the line (the character count) as well as features for the total line width (the constraint) are computed by accumulating features for individual token lengths. 2. The model then combines these two representations — current position and line width — to estimate the distance from the end of the line, leading to “characters remaining” features. 3. Finally, the model uses this estimate of characters remaining along with features for the planned next word to determine if the next word will fit on the line or not. The attribution graph provides a kind of execution trace of the algorithm, showing on this prompt which variables are computed and from what. After finding large feature families involved in representing these quantities across a diverse dataset, we suspected a simpler lens might be provided in terms of lower-dimensional feature manifolds interacting geometrically. We found geometric perspectives on the following questions: [orrTSUAAAAA%0ASUVORK5CYII=] Key steps in the linebreaking behavior can be described in terms of the construction and manipulation of manifolds. How does the model represent different counts? The number of characters in a token, the number of characters in the current line, the overall line width constraint, and the number of characters remaining in the current line are each represented on 1-dimensional feature manifolds embedded with high curvature in low-dimensional subspaces of the residual stream. These manifolds have a dual interpretation in terms of discrete features, which tile the manifold in a canonical way, providing approximate local coordinates. Manifolds with similar geometry arise for a variety of ordinal concepts, and a ringing pattern we see in the embedded geometry in all these cases is optimal with respect to a simple physical model (§[20]Representing Character Count). ^5 Ringing, in the manifold perspective, corresponds to interference in the feature superposition perspective. How does the model detect the boundary? To detect an approaching line boundary, the model must compare two quantities: the current character count and the line width. We find attention heads whose QK matrix rotates one counting manifold to align it with the other at a specific offset, creating a large inner product when the difference of the counts falls within a target range. Multiple heads with different offsets work together to precisely estimate the characters remaining (§[21]Sensing the Line Boundary). How does the model know if the next word fits? The final decision — whether to predict a newline — requires combining the estimate of characters remaining with the length of the predicted next word. We discover that the model positions these counts on near-orthogonal subspaces, creating a geometric structure where the correct linebreak prediction is linearly separable (§[22]Predicting the Newline). How does the model construct these curved geometries? The curvature in the character count representation manifold is produced by many attention heads working together, each contributing a piece of the overall curvature. This distributed algorithm is necessary because individual components cannot generate sufficient output variance to create the full representation (§[23]A Distributed Character Counting Algorithm). We validate these interpretations through targeted interventions, ablations, and “visual illusions” — character sequences that hijack specific attention mechanisms to disrupt spatial perception (§[24]Visual Illusions). Zooming out, we take several broader lessons from this mechanistic case study: When Models Manipulate Manifolds. For representing a scalar quantity (e.g., integer counts from [MATH: <semantics><mrow><mn>1</mn></mrow><annotation encoding="application/x-tex">1</annotation></semantics> :MATH] 1 1 to [MATH: <semantics><mrow><mi>N</mi></mrow><annotation encoding="application/x-tex">N</annotation></semantics> :MATH] N N), it is inefficient to use [MATH: <semantics><mrow><mi>N</mi></mrow><annotation encoding="application/x-tex">N</annotation></semantics> :MATH] N N orthogonal dimensions, and not expressive enough to use just one ^6 Orthogonal dimensions would also not be robust to estimation noise.. Instead models learn to represent these quantities on a feature manifold with intrinsic dimension 1 (the count) embedded in a subspace with extrinsic dimension [MATH: <semantics><mrow><mn>1</mn><mo><</mo><mi>d</mi><mo>≪</mo><mi>N</mi></mr ow><annotation encoding="application/x-tex">1 < d \ll N</annotation></semantics> :MATH] 1<d≪N 1 < d \ll N (e.g., * Curve Detector Manifolds in InceptionV1  [25][link] L. Gorton. 2024. * Not All Language Model Features Are One-Dimensionally Linear  [26][link] J. Engels, E.J. Michaud, I. Liao, W. Gurnee, M. Tegmark. The Thirteenth International Conference on Learning Representations. 2025. * The Origins of Representation Manifolds in Large Language Models A. Modell, P. Rubin-Delanchy, N. Whiteley. arXiv preprint arXiv:2505.18235. 2025. [3, 4, 7] ), in which the curve “ripples”. Such rippled manifolds optimally trade off capacity constraints (roughly, dimensionality) with maintaining the distinguishability of different scalar values (curvature). Our work demonstrates the intricate ways in which these manifolds can be manipulated to perform computation and show how this can require distributing computation across multiple model components. Duality of Features and Geometry. Dictionary features provide an unsupervised entry point for discovering mechanisms, and attribution graphs surface the important features for any particular prediction. Sometimes, discrete features (and their interactions) can be equivalently described using continuous feature manifolds (and their transformations). In cases where it is possible to explicitly parameterize the manifold (as with the various integer counts we study), we can directly study the geometry, making some operations clearer (e.g., boundary detection). But this approach is expensive in researcher time and potentially limited in scope: it's straightforward when studying known continuous variables but becomes difficult to execute correctly for more complex, difficult-to-parametrize concepts. Complexity Tax. While unsupervised discovery is a victory in and of itself, dictionary features fragment the model into a multitude of small pieces and interactions – a kind of complexity tax on the interpretation. In cases where a manifold parametrization exists, we can think of the geometric description as reducing this tax. In other cases, we will need additional tools to reduce the interpretation burden, like hierarchical representations * From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit V. Costa, T. Fel, E.S. Lubana, B. Tolooshams, D. Ba. arXiv preprint arXiv:2506.03093. 2025. [8] or macroscopic structure in the global weights * Interpretability Dreams  [27][HTML] C. Olah. 2023. [9] . We would be excited to see methods that extend the dictionary learning paradigm to unsupervised discovery of other kinds of geometric structures (e.g., those found in prior work * A structural probe for finding syntax in word representations  [28][PDF] J. Hewitt, C.D. Manning. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4129--4138. 2019. [29]DOI: 10.18653/v1/N19-1419 * Visualizing and measuring the geometry of BERT  [30][PDF] A. Coenen, E. Reif, A. Yuan, B. Kim, A. Pearce, F. Viégas, M. Wattenberg. Advances in Neural Information Processing Systems, Vol 32. 2019. * The geometry of multilingual language model representations T.A. Chang, Z. Tu, B.K. Bergen. arXiv preprint arXiv:2205.10964. 2022. * Relational composition in neural networks: A survey and call to action M. Wattenberg, F.B. Viegas. arXiv preprint arXiv:2407.14662. 2024. * The geometry of categorical and hierarchical concepts in large language models K. Park, Y.J. Choe, Y. Jiang, V. Veitch. arXiv preprint arXiv:2406.01506. 2024. * The geometry of concepts: Sparse autoencoder feature structure Y. Li, E.J. Michaud, D.D. Baek, J. Engels, X. Sun, M. Tegmark. Entropy, Vol 27(4), pp. 344. MDPI. 2025. * The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence  [31][link] T. Wollschlager, J. Elstner, S. Geisler, V. Cohen-Addad, S. Gunnemann, J. Gasteiger. arXiv preprint arXiv:2502.17420. 2025. * Projecting assumptions: The duality between sparse autoencoders and concept geometry S.S.R. Hindupur, E.S. Lubana, T. Fel, D. Ba. arXiv preprint arXiv:2503.01822. 2025. * The Origins of Representation Manifolds in Large Language Models A. Modell, P. Rubin-Delanchy, N. Whiteley. arXiv preprint arXiv:2505.18235. 2025. [10, 11, 12, 13, 14, 15, 16, 17, 7] ). Natural Tasks. The crispness of the representations and circuits we found was quite striking, and may be due to how well the model does the task. Linebreaking is an extremely natural behavior for a pretrained language model, and even tiny models are capable of it given enough context. Studying tasks which are natural for pretrained language models, instead of those of more theoretical interest to human investigators, may offer promising targets for finding general mechanisms. Preliminaries To enable systematic analysis, we created a synthetic dataset using a text corpus of diverse prose where we (1) stripped out all newlines and (2) reinserted newlines every [MATH: <semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics> :MATH] k k characters to the nearest word boundary [MATH: <semantics><mrow><mo>≤</mo><mi>k</mi></mrow><annotation encoding="application/x-tex">\leq k</annotation></semantics> :MATH] ≤k \leq k for [MATH: <semantics><mrow><mi>k</mi><mo>=</mo><mn>1</mn><mn>5</mn><mo separator="true">,</mo><mn>2</mn><mn>0</mn><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><mn>1</mn><mn>5</mn><mn>0</mn></mrow><annotation encoding="application/x-tex">k=15,20,\ldots,150</annotation></semantics > :MATH] k=15,20,…,150 k=15,20,\ldots,150. As an example, here is the opening sentence of the Gettysburg Address, wrapped to [MATH: <semantics><mrow><mi>k</mi><mo>=</mo><mn>4</mn><mn>0</mn></mrow><annota tion encoding="application/x-tex">k=40</annotation></semantics> :MATH] k=40 k=40 characters, with the newlines shown explicitly. Four score and seven years ago our⏎ fathers brought forth on this continent,⏎ a new nation, conceived in Liberty, and⏎ dedicated to the proposition that all⏎ men are created equal. Claude 3.5 Haiku is able to adapt to the line length for every value of [MATH: <semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics> :MATH] k k, predicting newlines at the correct positions with high probability by the third line (see [32]Appendix). All features in the main text of this paper are from a 10 million feature Weakly Causal Crosscoder (WCC) dictionary * Sparse Crosscoders for Cross-Layer Features and Model Diffing  [33][HTML] J. Lindsey, A. Templeton, J. Marcus, T. Conerly, J. Batson, C. Olah. 2024. [18] trained on Claude 3.5 Haiku. Feature activation values are normalized to their max throughout. [34]Representing Character Count We define the line character count (or character count) at a given token in a prompt to be the total number of characters since the last newline, including the characters of the current token. A natural thing to check is if the model linearly represents the character count as a quantitative variable: that is, can we predict character count with high accuracy via linear regression on the residual stream? Yes: a linear probe fit on the residual stream after layer 1 has an [MATH: <semantics><mrow><msup><mi>R</mi><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">R^2</annotation></semantics> :MATH] R2 R^2 of 0.985. This success does not mean, however, that the model actually represents the character count along a single line. Instead, we find a multidimensional representation of the character count that we will analyze from four perspectives: 1. Sparse crosscoder features. ^7 Each feature has an encoder, which acts as a linear + (Jump)ReLU probe on the residual stream, and a decoder. Ten features [MATH: <semantics><mrow><msub><mi>f</mi><mn>1</mn></msub><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msub><mi>f</mi><mrow><mn>1</mn><mn>0</mn></mrow ></msub></mrow><annotation encoding="application/x-tex">f_1,\ldots,f_{10}</annotation></semantics> :MATH] f1​,…,f10​ f_1,\ldots,f_{10} are associated with line character count. The model's estimate of the character count, given a residual stream vector [MATH: <semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics> :MATH] x x, is summarized by the set of activities of each of the 10 features [MATH: <semantics><mrow><mo>{</mo><msub><mi>f</mi><mi>i</mi></msub><mo>(</mo>< mi>x</mi><mo>)</mo><mo>}</mo></mrow><annotation encoding="application/x-tex">\{f_i(x)\}</annotation></semantics> :MATH] {fi​(x)} \{f_i(x)\}. A low-dimensional subspace. ^8 The model's estimate of the character count is summarized by the projection [MATH: <semantics><mrow><mi>π</mi><mo>(</mo><mi>x</mi><mo>)</mo></mrow><annota tion encoding="application/x-tex">\pi(x)</annotation></semantics> :MATH] π(x) \pi(x) of [MATH: <semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics> :MATH] x x onto that subspace. Two datapoints have similar character counts if their projections are close in that subspace. A continuous 1-dimensional manifold contained in that low-dimensional subspace. ^9 The model's estimate of the character count is summarized by the nearest point on the manifold to the projection of [MATH: <semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics> :MATH] x x into the subspace, and its confidence in that estimate by the magnitude of [MATH: <semantics><mrow><mi>π</mi><mo>(</mo><mi>x</mi><mo>)</mo></mrow><annota tion encoding="application/x-tex">\pi(x)</annotation></semantics> :MATH] π(x) \pi(x). A set of 150 logistic probes (corresponding to values of line character count from 1 to 150). ^10 The model's estimate of the character count is summarized by the probability distribution given by the softmax of the probe activities, softmax [MATH: <semantics><mrow><mo>(</mo><mi>P</mi><mi>x</mi><mo>)</mo></mrow><annota tion encoding="application/x-tex">(Px)</annotation></semantics> :MATH] (Px) (Px). Each of these perspectives provides a complementary view of the same underlying object. The feature perspective is valuable for getting oriented, the subspace is perfect for causal intervention, the manifold is helpful for understanding how the representation is constructed and then manipulated to detect boundaries, and the logistic probes are useful for analyzing the OV and QK matrices of the individual attention heads involved. Character Count Features We begin with the features. In layers one and two, we found features that seemed to activate based on a token’s character position within a line. For example, in the attribution graph for the aluminum prompt, there were two features active on the final word “called” that seemed to fire when the line character count was between 35–55 and 45–65, respectively. To find more such features, we computed the mean activation of each feature binned by line character count. There were ten features with smooth profiles and large between-character-count variance, shown below: [XtsAAAAAElFTkSuQmCC] A family of features representing the current character count in a line of text. The tuning curve of the features’ activity increases at larger line character counts. We find these features especially interesting as they are quite analogous to curve-detector features in vision models * Curve Circuits  [35][link] N. Cammarata, G. Goh, S. Carter, C. Voss, L. Schubert, C. Olah. Distill. 2021. * The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision  [36][link] L. Gorton. arXiv preprint arXiv:2406.03662. 2024. [19, 20] and place cells in biological brains * Place cells, grid cells, and the brain's spatial representation system.  [37][link] E.I. Moser, E. Kropff, M. Moser. Annual review of neuroscience, Vol 31, pp. 69-89. 2008. [21] . In all three of these cases, a continuous variable is represented by a collection of discrete elements that activate for particular value ranges. Moreover, we also observe dilation of the receptive fields (i.e., subsequent features activate over increasingly large character ranges) which is a common characteristic of biological perception of numbers (e.g., * The neural basis of the Weber--Fechner law: a logarithmic mental number line S. Dehaene. Trends in cognitive sciences, Vol 7(4), pp. 145--147. Elsevier. 2003. * Tuning curves for approximate numerosity in the human intraparietal sulcus M. Piazza, V. Izard, P. Pinel, D. Le Bihan, S. Dehaene. Neuron, Vol 44(3), pp. 547--555. Elsevier. 2004. [22, 23] ). In the [38]Appendix, we show these features are universal across dictionaries of different sizes, but that some feature splitting occurs with respect to the line width constraint. The Model Represents Character Count on a Continuous Manifold We observe that character count feature activations rise and fall at an offset, with two features being active at a time for most counts. This pattern suggests that the features are reconstructing a curved continuous manifold, locally parametrized by the activity of the two most active features. Given that their joint activation profiles follow a sinusoidal pattern, we expect reconstructions to lie on a curve between adjacent feature decoders. To visualize this, we first compute the average layer 2 residual stream for each value of line character count on our synthetic dataset. We compute the PCA of these 150 vectors, and find that the top 6 components capture 95% of the variance; we project data to that 6 dimensional subspace which we call the “character count subspace” (top 3 PCs on the left below, next 3 PCs on the right). We observe the data form a twisting curve, resembling a helix from the perspective of PCs 1–3 and a more complex twist from the perspective of PCs 4–6. We also reconstruct the residual stream for each datapoint using only the 10 character count features identified above, and compute the average reconstructed residual stream. We project the resulting curve, along with the feature decoders, into the same subspace. We find that the average line character count vectors are quite closely approximated by the feature reconstruction, though with mild kinks near the feature vectors themselves, reminiscent of a spline approximation of a smooth curve. While the 10 feature vectors discretize the curve, interpolating between the 2–3 neighboring features which are active at a time allows for a high-quality reconstruction of 150 data points. Average Reconstructed DataAverage DataFeature DecodersFeature Reconstruction of Mean ActivationsFirst 3 PCNext 3 PC Character count is represented on a manifold in a 6 dimensional subspace (jagged line). This manifold can be approximately locally parametrized by the features we identified (crosses). Validation: The Character Count Subspace is Causal To validate our interpretation of the character count subspace, we perform a coarse-grained ablation and a fine-grained intervention. Ablation Experiment. For our ablation experiment, we zero ablate (from a single early layer) a [MATH: <semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics> :MATH] k k-dimensional subspace corresponding to the top [MATH: <semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics> :MATH] k k principal components of the per–character count mean activations and compare to a baseline of ablating a random [MATH: <semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics> :MATH] k k-dimensional subspace. Below we measure the loss effect, broken down by newlines and nonnewlines. ^11 Note, in general one should not assume that a subspace spanned by features (or a PCA) is dedicated to those features because it could be in superposition with many other features. However, because in this case the character count subspace is densely active (and therefore less amenable to being in superposition), this experimental design is more justified. [cA0tAAhKQgAQkIAEJSEACEpCABCQggd4TUITqPXNX%0AlIAEJCABCUhAAhKQgAQkIAEJSE ACfUdAEarvHrkHloAEJCABCUhAAhKQgAQkIAEJSEACvSegCNV7%0A5q4oAQlIQAISkIAEJC ABCUhAAhKQgAT6joAiVN89cg8sAQlIQAISkIAEJCABCUhAAhKQgAR6T0AR%0AqvfMXVECEp CABCQgAQlIQAISkIAEJCABCfQdAUWovnvkHlgCEpCABCQgAQlIQAISkIAEJCABCfSe%0AgC JU75m7ogQkIAEJSEACEpCABCQgAQlIQAIS6DsCilB998g9sAQkIAEJSEACEpCABCQgAQlIQ AIS%0A6D0BRajeM3dFCUhAAhKQgAQkIAEJSEACEpCABCTQdwQUofrukXtgCUhAAhKQgAQkI AEJSEACEpCA%0ABCTQewKKUL1n7ooSkIAEJCABCUhAAhKQgAQkIAEJSKDvCChC9d0j98ASk IAEJCABCUhAAhKQgAQk%0AIAEJSKD3BP4ETclv+7AO5G0AAAAASUVORK5CYII=] Ablating the character count subspace has a large effect only when the next token is a newline. Intervention Experiment. As a more surgical intervention, we perform an experiment to modify the perceived character count at the end of the aluminum prompt (originally 42 characters). Specifically, we sweep over character counts [MATH: <semantics><mrow><mi>c</mi></mrow><annotation encoding="application/x-tex">c</annotation></semantics> :MATH] c c, and substitute the mean activation across all tokens in our dataset with count [MATH: <semantics><mrow><mi>c</mi></mrow><annotation encoding="application/x-tex">c</annotation></semantics> :MATH] c c. That is, [MATH: <semantics><mrow><msub><mi>a</mi><mtext>patched</mtext></msub><mo>=</mo ><msub><mi>a</mi><mtext>original</mtext></msub><mo>−</mo><msub><mi>μ</m i><mtext>original</mtext></msub><mo>+</mo><msub><mi>μ</mi><mi>c</mi></m sub></mrow><annotation encoding="application/x-tex">a_{\text{patched}} = a_{\text{original}} - \mu_{\text{original}} + \mu_{c}</annotation></semantics> :MATH] apatched​=aoriginal​−μoriginal​+μc​ a_{\text{patched}} = a_{\text{original}} - \mu_{\text{original}} + \mu_{c} for activation [MATH: <semantics><mrow><mi>a</mi></mrow><annotation encoding="application/x-tex">a</annotation></semantics> :MATH] a a and average activation matrix [MATH: <semantics><mrow><mi>μ</mi></mrow><annotation encoding="application/x-tex">\mu</annotation></semantics> :MATH] μ \mu. We perform this intervention for three adjacent early layers and the last two tokens for both the entire mean vector and within the 6 dimensional PCA space of the mean vectors. ^12 The attribution graph has several positional features and edges on both the last token (“called”) as well as the second-to-last token (“also”). We change the “also” count representation to be 6 characters prior to that for the final token, to maintain consistency. [%0AA9va+aFkp8p+AAAAAElFTkSuQmCC] Intervening on a rank 6 subspace is sufficient to change the model’s linebreaking behavior. The Probe Perspective We also train supervised logistic regression probes to predict character count. ^13 as a 150-way multiclass classification problem Probes trained after layer 1 achieve a root mean squared error of 5, indicating some intrinsic noise in the character count representation — which is consistent with our features having relatively wide receptive fields. Performing PCA on the 150 probe weight vectors, we find that 6 components capture 82% of the variance. When we look at the average responses of each probe to tokens with different line character counts, we see a striking pattern. In addition to a diagonal band (probes, like the sparse features, have increasingly wide receptive fields), we see two faint off-diagonal bands on each side! The response curve of each probe is not monotonically decreasing away from its max, but rebounds. This “ringing” turns out to be a natural consequence of embedding a “rippled” manifold into low dimensions. [PmvR8aDYCCCCA%0AAAIIIIAAAggggAACCCCAQL0EmBG8XvLUiwACCCCAAAIIIIAAAggggA ACCLSaAJGmVutRzgcBBBBA%0AAAEEEEAAAQQQQAABBBColwCRpnrJUy8CCCCAAAIIIIAAAg gggAACCCDQagJEmlqtRzkfBBBAAAEE%0AEEAAAQQQQAABBBBAoF4CRJrqJU+9CCCAAAIIII AAAggggAACCCCAQKsJEGlqtR7lfBBAAAEEEEAA%0AAQQQQAABBBBAAIF6CRBpqpc89SKAAA IIIIAAAggggAACCCCAAAKtJkCkqdV6lPNBAAEEEEAAAQQQ%0AQAABBBBAAAEE6iVApKle8t SLAAIIIIAAAggggAACCCCAAAIItJoAkaZW61HOBwEEEEAAAQQQQAAB%0ABBBAAAEEEKiXAJ GmeslTLwIIIIAAAggggAACCCCAAAIIINBqAkSaWq1HOR8EEEAAAQQQQAABBBBA%0AAAEEEE CgXgJEmuolT70IIIAAAggggAACCCCAAAIIIIBAqwkQaWq1HuV8EEAAAQQQQAABBBBAAAEE% 0AEEAAgXoJEGmqlzz1IoAAAggggAACCCCAAAIIIIAAAq0mQKSp1XqU80EAAQQQQAABBBBAA AEEEEAA%0AAQTqJUCkqV7y1IsAAggggAACCCCAAAIIIIAAAgi0mgCRplbrUc4HAQQQQAABB BBAAAEEEEAAAQQQ%0AqJcAkaZ6yVMvAggggAACCCCAAAIIIIAAAggg0GoCRJparUc5HwQQQ AABBBBAAAEEEEAAAQQQQKBe%0AAkSa6iVPvQgggAACCCCAAAIIIIAAAggggECrCRBparUe5 XwQQAABBBBAAAEEEEAAAQQQQACBegkQ%0AaaqXPPUigAACCCCAAAIIIIAAAggggAACrSZAp KnVepTzQQABBBBAAAEEEEAAAQQQQAABBOolQKSp%0AXvLUiwACCCCAAAIIIIAAAggggAACC LSaAJGmVutRzgcBBBBAAAEEEEAAAQQQQAABBBColwCRpnrJ%0AUy8CCCCAAAIIIIAAAgggg AACCCDQagJEmlqtRzkfBBBAAAEEEEAAAQQQQAABBBBAoF4CRJrqJU+9%0ACCCAAAIIIIAAA ggggAACCCCAQKsJEGlqtR7lfBBAAAEEEEAAAQQQQAABBBBAAIF6CRBpqpc89SKA%0AAAIII IAAAggggAACCCCAAAKtJkCkqdV6lPNBAAEEEEAAAQQQQAABBBBAAAEE6iVApKle8tSLAAII %0AIIAAAggggAACCCCAAAIItJoAkaZW61HOBwEEEEAAAQQQQAABBBBAAAEEEKiXAJGmeslT LwIIIIAA%0AAggggAACCCCAAAIIINBqAkSaWq1HOR8EEEAAAQQQQAABBBBAAAEEEECgXgJE muolT70IIIAAAggg%0AgAACCCCAAAIIIIBAqwkQaWq1HuV8EEAAAQQQQAABBBBAAAEEEEAA gXoJEGmqlzz1IoAAAggggAAC%0ACCCAAAIIIIAAAq0mQKSp1XqU80EAAQQQQAABBBBAAAEE EEAAAQTqJUCkqV7y1IsAAggggAACCCCA%0AAAIIIIAAAgi0mgCRplbrUc4HAQQQQAABBBBA AAEEEEAAAQQQqJcAkaZ6yVMvAggggAACCCCAAAII%0AIIAAAggg0GoCRJparUc5HwQQQAAB BBBAAAEEEEAAAQQQQKBeAkSa6iVPvQgggAACCCCAAAIIIIAA%0AAggggECrCfwf5L4PVlDh q4UAAAAASUVORK5CYII=] Response curve of Line Character Count probes as a function of Line Character Count show widening receptive fields and a "ringing" pattern of off-diagonal stripes. [39]Rippled Representations are Optimal We note that the cosine similarities of the mean activation vector (which form the helix-like curve visualized in PCA space above), the linear probe vectors, and feature decoder vectors all exhibit similar ringing patterns to the above figure. ^14 We use the term “[40]ringing” in the sense of signal processing, a transient oscillation in response to a sharp peak, such as in the Gibbs Phenomenon). Note that not only are neighboring features not orthogonal, features further away have negative similarities, and then those even further away have positive ones again. [%0AnddQQXneinQAAAAASUVORK5CYII=] This structure turns out to be a natural consequence of having the desired pattern of similarity, trivially achievable in 150 dimensions, projected down to low dimensions. As a toy model of this, suppose that we wish to have a discretized circle's worth of unit vectors, each similar to its neighbors but orthogonal to those further away. This can be realized by a symmetric set of unit vectors in 150 dimensions with cosine similarity matrix [MATH: <semantics><mrow><mi>X</mi></mrow><annotation encoding="application/x-tex">X</annotation></semantics> :MATH] X X pictured below (left). Projecting this to its top 5 eigenvectors yields a 5-dimensional embedding of the same vectors with cosine similarity matrix (below right) exhibiting ringing. We also plot the curve these vectors form in the top 3 eigenvectors. We can think of the original 150-dimensional embedding of the circle as being highly curved, and the resulting 5-dimensional embedding as retaining as much of that curvature as possible. This manifests as ripples in the embedding of the circle when viewed in a 3D projection. A relationship of this construction to Fourier features is discussed in the [41]appendix. [8BfisgNy7AxTUAAAAASUVORK5CYII=] Left panel shows an ideal similarity matrix for vectors representing points along a circle. Middle panel shows the optimal (PCA) approximation possible when embedding the points in 5 dimensions. Right panel shows the resulting projection of the circle to the top 3 dimensions, exhibiting rippling. Alternatively, one can view the ringing from the perspective of sparse feature decoders as a kind of interference weight * A Toy Model of Interference Weights  [42][HTML] C. Olah, N.L. Turner, T. Conerly. 2025. [24] . With no capacity constraints, the model might use orthogonal vectors to represent the quantitative response of each feature, with its own receptive field, to the input data. Forced to put them into lower dimensional superposition, the similarity matrix picks up both a wider diagonal stripe and the upper/lower diagonal ringing stripes. Finally, we also construct a simple physical model showing that the rippling and ringing arise even when the solution is found dynamically, whenever many vectors are packed into a small number of dimensions. Below, we show the result of a simulation in which 100 points confined to a [MATH: <semantics><mrow><mn>6</mn></mrow><annotation encoding="application/x-tex">6</annotation></semantics> :MATH] 6 6-dimensional hypersphere are subjected to attractive forces to their 6 closest neighbors on each side (matching the RMSE error of our probes) and repulsive forces to all other points. (To avoid boundary conditions, we use the topology of a circle instead of an interval.) On the right below is a heatmap exhibiting two rings, and on the left is a 3-dimensional projection of the 6-dimensional curve. This simulation is interactive, and the reader is encouraged to experiment with reinitializing the points (↺), switching the ambient dimension, and modifying the width of the attractive zone. Decreasing the attractive zone or increasing the embedding dimension both increase curvature (and the amount of ringing), and vice versa. ^15 The simulation can sometimes find itself in local minima. Increasing the width of the attractive zone before decreasing it again usually solves this issue. As the number of points on the curve grows and the attractive zone width shrinks (in relative terms), the curvature grows quite extreme, approaching a space-filling curve in the limit. Physical Simulation Interactive visualization of particle dynamics on an n-dimensional sphere. Particles attract neighbors and repel distant points. (BUTTON) ▶ (BUTTON) ⏸ (BUTTON) ↺ Dimensions: (BUTTON) 3D (BUTTON) 4D (BUTTON) 5D (BUTTON) 6D (BUTTON) 7D (BUTTON) 8D Zone Width: 6 Topology: (BUTTON) Circle (BUTTON) Interval Speed: 5 Projection of points Drag to rotate Inner Product Matrix Of particular interest is the result from setting the ambient dimension to 3 ^16 Optimization in dimension 3, unlike in higher dimensions, admits bad local minima, because a generic curve on the surface of a sphere self-intersects. To avoid this, either increase the zone width until you get a great circle, then decrease it, or do the optimization in 4D, then select 3D.: the result is a curve similar to the seams of a baseball (below left, circle), which matches the topology observed for three intrinsically one-dimensional phenomena observed in * The Origins of Representation Manifolds in Large Language Models A. Modell, P. Rubin-Delanchy, N. Whiteley. arXiv preprint arXiv:2505.18235. 2025. * Not All Language Model Features Are One-Dimensionally Linear  [43][link] J. Engels, E.J. Michaud, I. Liao, W. Gurnee, M. Tegmark. The Thirteenth International Conference on Learning Representations. 2025. [7, 4] , of colors by hue, dates of the year, and years of the 20th century (which also exhibit dilation). Similar ripples were predicted to occur by Olah * Feature Manifold Toy Model  [44][link] C. Olah, J. Batson. 2023. [1] and then observed by Gorton * Curve Detector Manifolds in InceptionV1  [45][link] L. Gorton. 2024. [3] in curve detector features in Inception v1. One of the earliest observations of ringing in a cosine similarity plot and rippled spiral/helix shape in a low-dimensional embedding was of the learned positional embeddings of tokens in GPT2 * {GPT-2}'s positional embedding matrix is a helix  [46][link] A. Yedidia. 2023. * The positional embedding matrix and previous-token heads: how do they actually work?  [47][link] A. Yedidia. Alignment Forum. 2023. [25, 26] . We also find similar structure in other representations, which we study in [48]More Sensory and Counting Representations in the appendix. [cO+oq5r4dwMAHhQXdY7U1SmwABAgQIEOhV%0AQEDY68wYFwECBAgQIECAAAECBAgQIECAA AECBAgQIECAAIEGAs4gbICqSQIECBAgQIAAAQIECBAg%0AQIAAAQIECBAgQIAAAQK9CggIe 50Z4yJAgAABAgQIECBAgAABAgQIECBAgAABAgQIECDQQEBA2ABV%0AkwQIECBAgAABAgQIE CBAgAABAgQIECBAgAABAgR6FRAQ9jozxkWAAAECBAgQIECAAAECBAgQIECA%0AAAECBAgQI ECggYCAsAGqJgkQIECAAAECBAgQIECAAAECBAgQIECAAAECBAj0KiAg7HVmjIsAAQIE%0AC BAgQIAAAQIECBAgQIAAAQIECBAgQIBAAwEBYQNUTRIgQIAAAQIECBAgQIAAAQIECBAgQIAA AQIE%0ACBDoVUBA2OvMGBcBAgQIECBAgAABAgQIECBAgAABAgQIECBAgACBBgICwgaomiRA gAABAgQIECBA%0AgAABAgQIECBAgAABAgQIECDQq4CAsNeZMS4CBAgQIECAAAECBAgQIECA AAECBAgQIECAAAECDQQE%0AhA1QNUmAAAECBAgQIECAAAECBAgQIECAAAECBAgQIECgVwEB Ya8zY1wECBAgQIAAAQIECBAgQIAA%0AAQIECBAgQIAAAQIEGggICBugapIAAQIECBAgQIAA AQIECBAgQIAAAQIECBAgQIBArwICwl5nxrgI%0AECBAgAABAgQIECBAgAABAgQIECBAgAAB AgQINBAQEDZA1SQBAgQIECBAgAABAgQIECBAgAABAgQI%0AECBAgACBXgUEhL3OjHERIECA AAECBAgQIECAAAECBAgQIECAAAECBAgQaCAgIGyAqkkCBAgQIECA%0AAAECBAgQIECAAAEC BAgQIECAAAECvQoICHudGeMiQIAAAQIECBAgQIAAAQIECBAgQIAAAQIECBAg%0A0EBAQNgA VZMECBAgQIAAAQIECBAgQIAAAQIECBAgQIAAAQIEehUQEPY6M8ZFgAABAgQIECBAgAAB%0A AgQIECBAgAABAgQIECBAoIGAgLABqiYJECBAgAABAgQIECBAgAABAgQIECBAgAABAgQI9Co gIOx1%0AZoyLAAECBAgQIECAAAECBAgQIECAAAECBAgQIECAQAMBAWEDVE0SIECAAAECBAg QIECAAAECBAgQ%0AIECAAAECBAgQ6FVAQNjrzBgXAQIECBAgQIAAAQIECBAgQIAAAQIECBA gQIAAgQYCAsIGqJokQIAA%0AAQIECBAgQIAAAQIECBAgQIAAAQIECBAg0KuAgLDXmTEuAgQ IECBAgAABAgQIECBAgAABAgQIECBA%0AgAABAg0EBIQNUDVJgAABAgQIECBAgAABAgQIECB AgAABAgQIECBAoFcBAWGvM2NcBAgQIECAAAEC%0ABAgQIECAAAECBAgQIECAAAECBBoICAg boGqSAAECBAgQIECAAAECBAgQIECAAAECBAgQIECAQK8C%0AAsJeZ8a4CBAgQIAAAQIECBA gQIAAAQIECBAgQIAAAQIECDQQEBA2QNUkAQIECBAgQIAAAQIECBAg%0AQIAAAQIECBAgQIA AgV4FBIS9zoxxESBAgAABAgQIECBAgAABAgQIECBAgAABAgQIEGggICBsgKpJ%0AAgQIECB AgAABAgQIECBAgAABAgQIECBAgAABAr0KCAh7nRnjIkCAAAECBAgQIECAAAECBAgQIECA%0 AAAECBAgQINBAQEDYAFWTBAgQIECAAAECBAgQIECAAAECBAgQIECAAAECBHoVEBD2OjPGRY AAAQIE%0ACBAgQIAAAQIECBAgQIAAAQIECBAgQKCBgICwAaomCRAgQIAAAQIECBAgQIAAAQ IECBAgQIAAAQIE%0ACPQqICDsdWaMiwABAgQIECBAgAABAgQIECBAgAABAgQIECBAgEADAQ FhA1RNEiBAgAABAgQIECBA%0AgAABAgQIECBAgAABAgQIEOhVQEDY68wYFwECBAgQIECAAA ECBAgQIECAAAECBAgQIECAAIEGAgLC%0ABqiaJECAAAECBAgQIECAAAECBAgQIECAAAECBA gQINCrgICw15kxLgIECBAgQIAAAQIECBAgQIAA%0AAQIECBAgQIAAAQINBASEDVA1SYAAAQ IECBAgQIAAAQIECBAgQIAAAQIECBAgQKBXAQFhrzNjXAQI%0AECBAgAABAgQIECBAgAABAg QIECBAgAABAgQaCAgIG6BqkgABAgQIECBAgAABAgQIECBAgAABAgQI%0AECBAgECvAgLCXm fGuAgQIECAAAECBAgQIECAAAECBAgQIECAAAECBAg0EBAQNkDVJAECBAgQIECA%0AAAECBA gQIECAAAECBAgQIECAAIFeBQSEvc6McREgQIAAAQIECBAgQIAAAQIECBAgQIAAAQIECBBo% 0AICAgbICqSQIECBAgQIAAAQIECBAgQIAAAQIECBAgQIAAAQK9CggIe50Z4yJAgAABAgQIE CBAgAAB%0AAgQIECBAgAABAgQIECDQQEBA2ABVkwQIECBAgAABAgQIECBAgAABAgQIECBAg AABAgR6FRAQ9joz%0AxkWAAAECBAgQIECAAAECBAgQIECAAAECBAgQIECggYCAsAGqJgkQI ECAAAECBAgQIECAAAECBAgQ%0AIECAAAECBAj0KiAg7HVmjIsAAQIECBAgQIAAAQIECBAgQ IAAAQIECBAgQIBAAwEBYQNUTRIgQIAA%0AAQIECBAgQIAAAQIECBAgQIAAAQIECBDoVUBA2 OvMGBcBAgQIECBAgAABAgQIECBAgAABAgQIECBA%0AgACBBgICwgaomiRAgAABAgQIECBAg AABAgQIECBAgAABAgQIECDQq4CAsNeZMS4CBAgQIECAAAEC%0ABAgQIECAAAECBAgQIECAA AECDQQEhA1QNUmAAAECBAgQIECAAAECBAgQIECAAAECBAgQIECgVwEB%0AYa8zY1wECBAgQ IAAAQIECBAgQIAAAQIECBAgQIAAAQIEGggICBugapIAAQIECBAgQIAAAQIECBAg%0AQIAAA QIECBAgQIBArwICwl5nxrgIECBAgAABAgQIECBAgAABAgQIECBAgAABAgQINBD4P9xH0Qqw %0Ao4YzAAAAAElFTkSuQmCC] Left curve is a locally optimal high-curvature embedding of the circle onto the 2-sphere. Right figures, reproduced with permission from Modell et al. * The Origins of Representation Manifolds in Large Language Models A. Modell, P. Rubin-Delanchy, N. Whiteley. arXiv preprint arXiv:2505.18235. 2025. [7] , show 3-dimensional PCA projections of data or features related to colours, years, and dates. [49]Sensing the Line Boundary We now study how the character counting representations are used to determine if the current line of text is approaching the line boundary. To detect the line boundary, the model needs to (1) determine the overall line width constraint and (2) compare the current character count with the line width to calculate the characters remaining. Twisting with QK We find that newline tokens have their own dedicated character [50]counting features that activate based on the width of the line, counting the number of characters between adjacent newlines. To better understand how these representations are related, we train 150 probes for each possible value of “Line Width” like we did for “Character Count”. Using the attribution graph, we identify an attention head which activates boundary detection features. We visualize both sets of counting representations directly using the first 3 components of their joint PCA in the residual stream (left) and in the reduced QK space of this boundary head (right). ^17 Specifically we multiply the line width probes through [MATH: <semantics><mrow><msub><mi>W</mi><mi>K</mi></msub></mrow><annotation encoding="application/x-tex">W_K</annotation></semantics> :MATH] WK​ W_K and the character count probes through [MATH: <semantics><mrow><msub><mi>W</mi><mi>Q</mi></msub></mrow><annotation encoding="application/x-tex">W_Q</annotation></semantics> :MATH] WQ​ W_Q, and plot the points in the 3D PCA basis of their joint embedding. Character Count ProbesLine Width ProbesCharacter Count through QLine Width through KAlignment Between Character Count Probes and Line Width ProbesIn the residual streamIn boundary head QK space Boundary heads twist the representation of line width and character count to detect the line boundary. Left: Joint PCA of character count and line width probes. Right: Same after multiplying them through the corresponding QK weights of the boundary head. Range is from 40 (dark) to 150 (light). We find that this attention head “twists” the character count manifold such that character count [MATH: <semantics><mrow><mi>i</mi></mrow><annotation encoding="application/x-tex">i</annotation></semantics> :MATH] i i is aligned with line width [MATH: <semantics><mrow><mi>k</mi><mo>=</mo><mi>i</mi><mo>+</mo><mi>ϵ</mi></mr ow><annotation encoding="application/x-tex">k=i+\epsilon</annotation></semantics> :MATH] k=i+ϵ k=i+\epsilon. This causes the head to attend to the newline when the character count is just a bit less than the line width, thereby indicating that the boundary is approaching. This algorithm is quite general, and enables this head to detect approaching line boundaries for arbitrary line widths! ^18 This algorithm also generalizes to arbitrary kinds of separators (e.g., double newlines or pipes), as the QK circuit can handle the positional offset independently of the OV circuit copying the separator type. [FN71uVBpQAAAABJRU5ErkJg%0Agg==] Cosine similarity of previous line width and character count probes through different transforms. (Left) the identity map, (Center) QK of boundary head, (Right) QK of random head in the same layer. Boundary heads align the probes but with a small offset. This plot shows that * In the residual stream, probes for character count [MATH: <semantics><mrow><mi>i</mi></mrow><annotation encoding="application/x-tex">i</annotation></semantics> :MATH] i i are maximally aligned with probes line width probes [MATH: <semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics> :MATH] k k when [MATH: <semantics><mrow><mi>i</mi><mo>=</mo><mi>k</mi></mrow><annotation encoding="application/x-tex">i=k</annotation></semantics> :MATH] i=k i=k, but are not highly aligned in absolute terms – the maximum cosine sim is ~0.25. In the QK space of the boundary head, the probes are maximally aligned on the offdiagonal [MATH: <semantics><mrow><mi>i</mi><mo><</mo><mi>k</mi></mrow><annotation encoding="application/x-tex">i < k</annotation></semantics> :MATH] i<k i < k, and are almost perfectly aligned in absolute terms – the maximum cosine sim is [MATH: <semantics><mrow><mo>≈</mo><mn>1</mn></mrow><annotation encoding="application/x-tex">\approx 1</annotation></semantics> :MATH] ≈1 \approx 1. In the QK space of a random head, there is almost no structure between the probes. As a consequence of the ringing in the character count representations, we also observe ringing in the inner products (see [51]Rippled Representations are Optimal above). The model is robust to these off-diagonal interference terms via the softmax applied to attention scores. Leveraging Multiple Boundary Heads We find that the model actually uses multiple boundary heads, each twisting the manifolds by a different offset to implement a kind of “stereoscopic” algorithm for computing the number of characters remaining. ^19 There are also multiple sets of boundary heads at multiple layers that usually come in sets of ~3 with similar relative offsets (so not actually “stereo”). We attach more visualizations of boundary heads in the [52]Appendix. [jzHuMAAAAASUVORK5CYII=] Cosine similarity of line width and character count probes through three different boundary heads in the same layer with different amounts of twisting. Green line indicates the argmax for each row, and is used to calculate the average offset reported in the subtitles. To better understand each boundary head’s output, we train a set of probes for each value of characters remaining in the line (i.e., the line width [MATH: <semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics> :MATH] k k minus the character count [MATH: <semantics><mrow><mi>i</mi></mrow><annotation encoding="application/x-tex">i</annotation></semantics> :MATH] i i, restricted to [MATH: <semantics><mrow><mi>k</mi><mo>−</mo><mi>i</mi><mo><</mo><mn>4</mn><mn> 0</mn></mrow><annotation encoding="application/x-tex">k - i < 40</annotation></semantics> :MATH] k−i<40 k - i < 40). For each boundary head, we show the proportion of attention on the newline, as well as the norm of each head’s output projected onto the probe space as a function of characters remaining. As predicted by our weights based analysis, we observe that boundary heads have distinct but overlapping response curves that “tile” the possible values of characters remaining. [8AkER8jisycLcAAAAASUVORK5CYII=] Each boundary head’s response curve peaks at a different distance from the end of the line. It's worth understanding why the model needs multiple boundary heads rather than just one. If the model relied only on boundary head 0, it couldn't distinguish between 5 characters remaining and 17 characters remaining—both would produce similar outputs. By having each head's output vary most significantly in different ranges, their sum achieves high resolution across the entire relevant range of “Characters Remaining” values. We can see this more clearly by plotting each head's output in the first two principal components of the characters remaining space (which captures 92% of the variance). Head 0 shows large variance in the [0, 10] and [15, 20] ranges, Head 1 varies most in the [10, 20] range, and Head 2 varies most in the [5, 15] range. While no single head provides high resolution across the entire curve, their sum produces an evenly spaced representation that covers all values effectively. [ALgC%0AWQMU8c3EAAAAAElFTkSuQmCC] Each head’s output as a function of characters remaining, and their sum in the PCA basis. Individual head outputs are almost one-dimensional, while the sum is a two-dimensional curve. We validate the causal importance of this two-dimensional subspace by performing an ablation and intervention experiment. Specifically, we conduct the same experiments as before: ablate the subspace and measure its effect on loss by token (left) and precisely modulate the characters remaining estimate on the last token in the aluminum prompt by substituting mean activation vectors. [E8hBBCCCGEEEIIIYQQ%0AQgghhBDCvifieQghhBBCCCGEEEIIIYQQQgghhH1PxPMQQgghh BBCCCGEEEIIIYQQQgj7nv8HuvR8%0ATdNtZBIAAAAASUVORK5CYII=] Characters remaining subspace can be causally intervened upon. (Left) Ablating the subspace has a large effect only when the next token is a newline. (Right) We surgically intervene on the characters remaining space to modulate the prediction of the newline by subtracting the true characters remaining mean activation and adding in a patched characters remaining activations. Note that the completion “ aluminum.” requires ten characters to fit. [53]The Role of the Extra Dimensions We are now in a position to understand two distinct but related questions: (1) why these counting representations are multidimensional and (2) why multiple attention heads are required to compute these multidimensional representations. Geometric Computations – A multi-dimensional representation enables the model to rotate position encodings using linear transformations—something impossible with one-dimensional representations. For instance, to detect an approaching line boundary, the model can rotate the position manifold to align with line width, then use a dot product to identify when only a few characters remain. With a 1D encoding, linear operations reduce to scaling and translation, so comparing position against line width would just multiply the two values, producing a monotonically increasing result with no natural threshold. Higher dimensions beyond 2D allow the manifold to pack more information through additional curvature. Resolution – For character counting, the model must distinguish between adjacent counts for a large range of character positions, as this determines whether the next word fits. In a one-dimensional representation, positions would be arranged along a ray, with each position separated by some constant [MATH: <semantics><mrow><mi>δ</mi></mrow><annotation encoding="application/x-tex">\delta</annotation></semantics> :MATH] δ \delta. To reliably distinguish adjacent positions above noise, we need [MATH: <semantics><mrow><mi mathvariant="normal">∣</mi><mi mathvariant="normal">∣</mi><msub><mi>v</mi><mrow><mn>4</mn><mn>2</mn></ mrow></msub><mo>−</mo><msub><mi>v</mi><mrow><mn>4</mn><mn>1</mn></mrow> </msub><mi mathvariant="normal">∣</mi><mi mathvariant="normal">∣</mi><mo>=</mo><mi>δ</mi></mrow><annotation encoding="application/x-tex">||v_{42} - v_{41}|| = \delta</annotation></semantics> :MATH] ∣∣v42​−v41​∣∣=δ ||v_{42} - v_{41}|| = \delta to exceed some threshold. But with 150+ positions to represent, this creates an untenable choice: either use enormous dynamic range ( [MATH: <semantics><mrow><mi mathvariant="normal">∣</mi><mi mathvariant="normal">∣</mi><msub><mi>v</mi><mrow><mn>1</mn><mn>5</mn><m n>0</mn></mrow></msub><mi mathvariant="normal">∣</mi><mi mathvariant="normal">∣</mi><mo>≫</mo><mi mathvariant="normal">∣</mi><mi mathvariant="normal">∣</mi><msub><mi>v</mi><mn>1</mn></msub><mi mathvariant="normal">∣</mi><mi mathvariant="normal">∣</mi></mrow><annotation encoding="application/x-tex">||v_{150}|| \gg ||v_1||</annotation></semantics> :MATH] ∣∣v150​∣∣≫∣∣v1​∣∣ ||v_{150}|| \gg ||v_1||), which is problematic for transformer computations, or sacrifice resolution between adjacent positions. (Normalization blocks only exacerbate this effect: while points can be spaced far away on a ray if their norms get large enough, there is at most [MATH: <semantics><mrow><mi>π</mi></mrow><annotation encoding="application/x-tex">\pi</annotation></semantics> :MATH] π \pi worth of angular distance along the projection of that ray onto the unit hypersphere.) Embedding the curve into higher dimensions solves this: positions maintain similar norms while being well-separated in the ambient space, achieving fine resolution without norm explosion (See [54]Rippled Representations are Optimal above.) For counting the characters remaining, the dynamic range is smaller, and so the model is able to embed the representation in a smaller subspace as a result. To achieve the curvature for necessary high resolution, multiple attention heads are needed to cooperatively construct the curved geometry of the counting manifold. An individual attention head's output is a linear combination of its inputs (weighted by attention and transformed by the OV circuit), and thus is fundamentally constrained by the curvature already present in those inputs. In the absence of MLP contributions to the counting representation, if the output manifold needs to exhibit substantial curvature, multiple attention heads need to coordinate—each contributing a piece of the overall geometric structure. We will see another example of distributed head computation in the section on the [55]Distributed Character Counting Algorithm. [56]A Discovery Story How did we originally find this boundary detection mechanism? When we first computed an attribution graph, we saw several edges from the previous newline features and embedding to predict-newline features. QK attributions showed that the top key feature was a “the previous line was 40–60 characters long” feature and the top query feature was “the current character count is 35–50” feature. At any one time there were often multiple counting features active at different strengths, suggesting that these features might be discretizing a manifold. [VRag0gAAAABJRU5ErkJggg==] The boundary heads cause a family of boundary detecting features to activate in response to how close the current line is to the global line width. That is, they sense the approaching line boundary or the reverse index of the line count. Investigating these three sets of feature families led us to the count manifolds which they sparsely parametrize, and investigating the relevant attention heads let us find the boundary heads. Finally, we note that these boundary-sensing representations parallels a well-studied phenomenon in neuroscience: boundary cells * Representation of geometric borders in the entorhinal cortex T. Solstad, C.N. Boccara, E. Kropff, M. Moser, E.I. Moser. Science, Vol 322(5909), pp. 1865--1868. American Association for the Advancement of Science. 2008. [27] , which activate at specific distances from environmental boundaries (e.g., walls). Both the artificial features and biological cells come in families with varied receptive fields and offsets. [57]Predicting the Newline The final step of the linebreak task is to combine the estimate of the line boundary with the prediction of the next word to determine whether the next word will fit on the line, or if the line should be broken. In the attribution graph for the aluminum prompt, we see exactly this merging of paths. The most influential feature ^20 Influence in the sense of influence on the logit node, as defined in Ameisen et al. * Circuit Tracing: Revealing Computational Graphs in Language Models  [58][HTML] E. Ameisen, J. Lindsey, A. Pearce, W. Gurnee, N.L. Turner, B. Chen, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. McDougall, H. Cunningham, T. Henighan, A. Jermyn, A. Jones, A. Persic, Z. Qi, T. Ben Thompson, S. Zimmerman, K. Rivoire, T. Conerly, C. Olah, J. Batson. Transformer Circuits. 2025. [6] in the entire graph is a late feature that activates in contexts where the next word would cause the current line to exceed the overall line width. For our prompt, this feature upweights the probability of newline and downweights the probability of “aluminum.” The top two inputs to this break predictor feature are a “say aluminum” feature and “boundary detecting” feature that gets activated by the aforementioned boundary head. [A4Ea7Aq3YLeRAAAAAElFTkSuQmCC] While the boundary detector activates regardless of the next token length, break predictor features activate only if the next token will exceed the length of the current line (as in the Aluminum prompt), and hence upweight the prediction of a newline. ^21 These features also sometimes activate on zero-width modifier tokens (e.g., a token which indicates the first letter of the following token should be capitalized) that need to be adjacent to the modified token, and the modified token is sufficiently long to go over the line limit (e.g. for “Aluminum” instead of “aluminum”). We also see break suppressor features, which only activate if the next token would just barely fit on the line, and hence downweight the prediction of a newline. Both break predictors and suppressors come in larger feature families, which we display in the [59]Appendix. [iYiIiIiIiIiIiIiIiIiIOHT+F9GfW8SHfJcNAAAAAElFTkSuQmCC] Average activations of three features based on true next token character length and the characters remaining in a line (line width − character count). Joint Geometry Enables Easy Computation What is the geometry underlying the model’s ability to determine if the next token will fit on the line? Put another way, how is the break predictor feature above constructed from the boundary detector and next-word features? To study this, we compute the average activations at the end of the model (~90% depth) across all tokens for all values of characters remaining [MATH: <semantics><mrow><mi>i</mi></mrow><annotation encoding="application/x-tex">i</annotation></semantics> :MATH] i i and next token lengths [MATH: <semantics><mrow><mi>j</mi></mrow><annotation encoding="application/x-tex">j</annotation></semantics> :MATH] j j. ^22 We use the true next non-newline token as the label. This is an approximation because it assumes that the model perfectly predicts the next token. By performing a PCA on the combination of mean vectors, we see that the two counts are arranged in orthogonal subspaces with only moderate curvature. Note, this lower dimensional geometry may suffice here because the dynamic range of the count is much smaller. Next Token LengthCharacters RemainingNext Token LengthCharacters RemainingMargin After Next WordOrthogonal Representations Create a Linear Decision Boundary for LinebreakingNext Word Length vs Characters RemainingThe Sum Makes Linebreaking Linearly Separable Low dimensional projections of next token character length and characters remaining counting manifolds for 1 (dark) to 15 (light) characters. (Left) The PCA of their union. (Right) The PCA of all their pairwise combinations. The orthogonal representations make the correct newline decision linearly separable. Now consider the pairwise sum of each possible character-remaining vector [MATH: <semantics><mrow><mi>i</mi></mrow><annotation encoding="application/x-tex">i</annotation></semantics> :MATH] i i and next-token-length vector [MATH: <semantics><mrow><mi>j</mi></mrow><annotation encoding="application/x-tex">j</annotation></semantics> :MATH] j j. ^23 This sum is principled because both sets of vectors are marginalized data means, so collectively have the mean of the data, which we center to be 0. Since these counts are arranged orthogonally, the decision to break the line [MATH: <semantics><mrow><mi>i</mi><mo>−</mo><mi>j</mi><mo>≥</mo><mn>0</mn></mr ow><annotation encoding="application/x-tex">i-j \geq 0</annotation></semantics> :MATH] i−j≥0 i-j \geq 0 corresponds to a simple separating hyperplane. In other words, the prediction to break the line is made trivial by the underlying geometry! When we use the separating hyperplane from the PCA of these average embeddings on real data, we achieve an AUC of 0.91 on the ground truth of whether the next token should be a newline. This reflects both the error of the three dimensional classifier and the error from Haiku’s estimates of the next token. If the length of the most likely next word is linearly represented, this scheme would allow the model to predict newlines when that word is longer than the length remaining in the line. One could imagine a more general mechanism where the model comprehensively redirects the probability mass from all words that exceed the line limit to the newline. Claude 3.5 Haiku does not seem to leverage such a mechanism: when we compare the predicted distribution of tokens at the end of a line to the distribution on an identical prompt with the newlines stripped, we find them to be quite different. [60]A Distributed Character Counting Algorithm Having described how the various character counting representations are used, the last big remaining question is: how are they computed? We will show how Haiku uses many attention heads across multiple layers to cooperatively compute an increasingly accurate estimate of the character count. This turned out to be the most complicated mechanism we studied, though there are many similarities with the boundary detection mechanism. To get an intuitive understanding of the behavior of the heads important for counting, we project their outputs into the PCA space of the line character count probes. ^24 We display the average outputs over many prompts. Layer 0 heads (left) each write along what appears as a ray when visualized in the first 3 principal components—it is their sum that generates a curved manifold. Layer 1 heads (right) instead output curves which combine to produce an increasingly complex manifold. They appear responsible for sharpening the Layer 0 representation and thus the estimate of the count. We find that the [MATH: <semantics><mrow><msup><mi>R</mi><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">R^2</annotation></semantics> :MATH] R2 R^2 for the character count prediction ^25 The prediction is the argmax of the head outputs projected on the character count probes. of the 5 key Layer 0 heads is 0.93, compared to 0.97 using 11 heads in the first two layers. Head 0Head 1Head 2Head 3Sum of Key HeadsHead 0Head 1Head 2Head 3Sum of Key HeadsIndividual Head Outputs Tile the Joint Output SpaceLayer 0 HeadsLayer 1 Heads Comparison of Layer 0 (left) vs Layer 1 (right) average attention outputs in the PCA basis of the character count probes from 1 (dark) to 150 (light) characters. In each layer, the outputs from each head tile the space. In Layer 0, each head output is almost 1-dimensional, while in Layer 1 heads display more curvature (which they got from Layer 0!). Embedding Geometry To understand how the character count is computed, we start at the very beginning: the embedding matrix. As before, we can train probes or compute the average weights for every distinct token length in the embedding. We visualize the token character count probes for character length 1–14 and visualize their top principal components. Using the first 3 principal components, which capture 70% of the variance, we see that embedding character counts are arranged in a circular pattern (PC1 vs PC2) with an oscillating component (PC3). This pattern is consistent with the ones observed in [61]Rippled Representations are Optimal. [An5siqGAAAAAElFTkSuQmCC] PCA of embedding vectors in [MATH: <semantics><mrow><msub><mi>W</mi><mi>E</mi></msub></mrow><annotation encoding="application/x-tex">W_E</annotation></semantics> :MATH] WE​ W_E averaged by token character length. As with all of the counting manifolds, we also find [62]features that discretize this space into overlapping notions of short, medium, and long words. Attention Head Outputs Sum To Produce the Count To understand the counting mechanism, we will work backwards from the summed attention outputs to the embedding. Notably, we: * Ignore MLPs – The attention head outputs affect the character count representation 4× more than the MLPs, so we restrict our focus to attention; * Focus on First Two Layers – Even after layer 0, counting probes have reasonable accuracy and there are coarse positional features. Therefore, we focus on how attention transforms the embeddings into the count and how layer 1 further refines this representation. [fly1btmzZsmXLli1btmzZ%0AsmXLli1btmzZsmXLli1btmzZss3z9n9nz6GggeOzfAAAAA BJRU5ErkJggg==] The summed output of 5 important Layer 0 heads on one prompt by token. (Left) The inner product of the summed attention outputs and the character counting probes; (Right) How the argmax of this product compares to the true line count. Context position starts at the first newline, with newlines denoted with dashes. We can decompose the sum above into the contribution from the output of each individual head in layer 0. ^26 We omit a previous token head for visual presentation. Under this lens, we see each head performing a relatively low rank computation akin to a classification. [+%0AMwWAv5ob6N77vH4xIV24BOCLeA5RsyBaEgEoMgMBKDMHASgzBwG4Ec8BAAAAAAAAyB PPAQAAAAAA%0AAMgTzwEAAAAAAADIE88BAAAAAAAAyBPPAQAAAAAAAMgTzwEAAAAAAADIE8 8BAAAAAAAAyBPPAQAA%0AAAAAAMgTzwEAAAAAAADIE88BAAAAAAAAyBPPAQAAAAAAAMgTzw EAAAAAAADIE88BAAAAAAAAyHsA%0AGTpfYGYdOIsAAAAASUVORK5CYII=] The individual outputs of 4 important Layer 0 heads on one prompt projected onto the character count probes. How do individual heads implement this behavior? We can break down the behavior of an individual head by analyzing its QK circuit (where it attends) and OV circuit (the linear transformation from the embeddings to the output) * A Mathematical Framework for Transformer Circuits  [63][HTML] N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, N. DasSarma, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, C. Olah. Transformer Circuits Thread. 2021. [28] . QK Circuit. Each head [MATH: <semantics><mrow><mi>h</mi></mrow><annotation encoding="application/x-tex">h</annotation></semantics> :MATH] h h uses the previous newline as an “attention sink,” such that for some number of tokens after the newline ( [MATH: <semantics><mrow><msub><mi>s</mi><mi>h</mi></msub></mrow><annotation encoding="application/x-tex">s_h</annotation></semantics> :MATH] sh​ s_h), the head just attends to the newline. After [MATH: <semantics><mrow><msub><mi>s</mi><mi>h</mi></msub></mrow><annotation encoding="application/x-tex">s_h</annotation></semantics> :MATH] sh​ s_h tokens, the head begins to smear its attention over its receptive field, which goes up to a maximum of [MATH: <semantics><mrow><msub><mi>r</mi><mi>h</mi></msub></mrow><annotation encoding="application/x-tex">r_h</annotation></semantics> :MATH] rh​ r_h tokens. [EcYgkpaBAAA%0AAABJRU5ErkJggg==] The average attention to the previous newline as a function of the token index in the line. Like boundary heads, these counting heads specialize with different positional offsets. OV Circuit. The OV circuit coordinates with the QK circuit to create a heuristic estimate based on the number of tokens in the line multiplied by the average token length ( [MATH: <semantics><mrow><msub><mi>μ</mi><mi>c</mi></msub><mo>≈</mo><mn>4</mn>< /mrow><annotation encoding="application/x-tex">\mu_c \approx 4</annotation></semantics> :MATH] μc​≈4 \mu_c \approx 4), with an additional length correction term. When attending to the newline, each head upweights the average token length multiplied by the head’s sink size: [MATH: <semantics><mrow><msub><mi>s</mi><mi>h</mi></msub><mo>×</mo><msub><mi>μ </mi><mi>c</mi></msub></mrow><annotation encoding="application/x-tex">s_h\times\mu_c</annotation></semantics> :MATH] sh​×μc​ s_h\times\mu_c characters. If no attention is paid to the newline, then from the perspective of the head, the current token must be at least [MATH: <semantics><mrow><msub><mi>s</mi><mi>h</mi></msub><mo>+</mo><msub><mi>r </mi><mi>h</mi></msub></mrow><annotation encoding="application/x-tex">s_h+r_h</annotation></semantics> :MATH] sh​+rh​ s_h+r_h tokens into the line and should upweight [MATH: <semantics><mrow><mo>(</mo><msub><mi>s</mi><mi>h</mi></msub><mo>+</mo>< msub><mi>r</mi><mi>h</mi></msub><mo>)</mo><mo>×</mo><msub><mi>μ</mi><mi >c</mi></msub></mrow><annotation encoding="application/x-tex">(s_h+r_h) \times \mu_c</annotation></semantics> :MATH] (sh​+rh​)×μc​ (s_h+r_h) \times \mu_c character outputs. Finally, the OV circuit applies an additional correction depending on whether the tokens in the receptive field are above or below average in length. Below, we include a detailed walkthrough of L0H1. [wnBBC%0ACCGEEEIIIeQqhBBCCCGEEEIIIYQQQgghhBBCCCFDnX8DSIJfCT3zFtIAAAAASU VORK5CYII=] The QK and OV circuit of counting head L0H1. Top right: the head output projected onto the character counting probes for 64 tokens of a single prompt (truncated to the first newline). Bottom right: the attention pattern (transposed of the canonical ordering). Top right: the average embedding vectors projected onto the character count probes via the OV matrix. Bottom right: a summary of the overall computation. For a more detailed analysis of each head, see [64]The Mechanics of Head Specialization. Layer 1 attention heads perform a similar operation, but additionally leverage the initial estimate of the character count (see [65]Layer 1 Head OVs). Computing the Line Width To compute the line width, the model seems to use a similar distributed counting algorithm to count the characters between adjacent newlines. However, one subtlety that we do not address in this work is how the line width is actually aggregated. It is possible that the model computes a global line width by taking the max over all line lengths in the document or uses an exponentially weighted moving average of the last several line lengths. We do note that the line width uses a partially disjoint set of heads, likely because the “attend to previous newline as a sink” mechanism needs modification when the current token is also a newline. [66]Visual Illusions Humans are susceptible to “visual illusions” in which contextual cues can modulate perception in seemingly unexpected ways. Famous examples include the Müller-Lyer illusion, in which arrows placed on the ends of a line can alter the perceived length of the line * The Muller-Lyer illusion explained by the statistics of image--source relationships C.Q. Howe, D. Purves. Proceedings of the National Academy of Sciences, Vol 102(4), pp. 1234--1239. National Academy of Sciences. 2005. [29] ; the Ponzo and Sander illusions which also modulate perceived line length * A review on various explanations of Ponzo-like illusions G.Y. Yildiz, I. Sperandio, C. Kettle, P.A. Chouinard. Psychonomic Bulletin \& Review, Vol 29(2), pp. 293--320. Springer. 2022. [30] ; and others * Space and time in visual context O. Schwartz, A. Hsu, P. Dayan. Nature Reviews Neuroscience, Vol 8(7), pp. 522--535. Nature Publishing Group UK London. 2007. [31] . [wcqXRvTaU3T%0AGQAAAABJRU5ErkJggg==] Classic visual illusions in which perception of line-length is modulated. Can we use our understanding of the character counting mechanism to construct a “visual illusion” for language models? To get started, we took the important attention heads for character counting and investigated what other roles they perform on a wider data distribution. We identified instances in which heads that normally attend from a newline to the previous newline would instead attend from a newline to the two-character string @@. This string occurs as a delimiter in git diffs, a circumstance in which you might want to start your line count at a location other than the newline: ⏎@@-14,30 +31,24 @@ export interface ClaudeCodeIAppTheme {⏎ But what happens when this sequence appears outside of a git diff context—for instance, if we insert @@ in the aluminum prompt without changing the line length? [8B%0AUyA+5UwpkRoAAAAASUVORK5CYII=] We find that it does modulate the predicted next token, disrupting the newline prediction! As predicted, the relevant heads get distracted: whereas with the original prompt, the heads attend from newline to newline, in the altered prompt, the heads also attend to the @@. [+cPCpigxmyUAAAAASUVORK5CYII=] Insertion of @@ ‘distracts’ an attention head which normally attends from \n back to the previous \n. (left) Original attention pattern (truncated). (right) Attention pattern (truncated) with @@ insertion. Now it also attends back to the @@. How specific is this result: does any pair of letters nonsensically inserted into the prompt fully disrupt the newline prediction? We analyzed the impact of inserting (at the same two positions) 180 different two-character sequences, half of which were a repeated character. We found that while most inserted sequences moderately impact the probability of predicting a newline, newline usually remains the top prediction. There was also no clear difference between sequences consisting of the same or different characters. However, a few sequences substantially disrupted newline prediction, most of which appeared to be related to code or delimiters of some kind: `` >> }} ;| || `, @@. We further analyzed the extent to which there was a relationship between ‘distraction’ of the important attention heads and the impact on the newline prediction. Indeed we found that many of the sequences with potent modulation of newline probability––and especially code-related character pairs––also exhibited substantial modulation of attention patterns. [UhN5pAAAAAElF%0ATkSuQmCC] Insertion of most pairs of characters only moderately impacts the probability of predicting a newline. A subset of pairs, most of which appear related to code or delimiters, substantially disrupt newline prediction. The impact on newline prediction (originally 0.79) is correlated with how much inserted tokens ‘distracts’ character counting attention heads. While in the aluminum prompt the task is implicit, this illusion generalizes to settings where the comparison task is made explicit. These direct comparisons are perhaps more analogous to the Ponzo, Sander, and Müller-Lyer illusions, where the perception and comparison is more direct. [3cXBwcHBwcHBwcHBwcHBwcHBwcHBwcHBwcHBwcHBwlOTxf2vvRcZ7iRdQAAAAAElF%0ATk SuQmCC] These effects are robust to multiple choice orderings. Moreover, if the length of the text following the @@ exceeds that of the alternative choice, the alternative choice is selected as being shorter. While we are not claiming any direct analogy between illusions of human visual perception and this alteration of line character count estimates, the parallels are suggestive. In both cases we can see the broader phenomena of contextual cues, and the application of learned priors about those cues, modulating estimates of object properties of entities. In the human case, priors such as three-dimensional perspective can influence perception of object size, or color constancy can influence estimates of luminance (such as in the checker shadow illusion). Here, one possible interpretation of our results is that mis-application of a learned prior, including the role of cues such as @@ in git diffs, can also modulate estimates of properties such as line length. [67]Related Work Objective. This work is at the intersection of LLM “biology” (making empirical observations about what is going on inside models; e.g. * On the Biology of a Large Language Model  [68][HTML] J. Lindsey, W. Gurnee, E. Ameisen, B. Chen, A. Pearce, N.L. Turner, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. McDougall, H. Cunningham, T. Henighan, A. Jermyn, A. Jones, A. Persic, Z. Qi, T.B. Thompson, S. Zimmerman, K. Rivoire, T. Conerly, C. Olah, J. Batson. Transformer Circuits Thread. 2025. * A primer in bertology: What we know about how bert works  [69][link] A. Rogers, O. Kovaleva, A. Rumshisky. Transactions of the Association for Computational Linguistics, Vol 8, pp. 842--866. MIT Press. 2020. [70]DOI: 10.1162/tacl_a_00349 [32, 33] ) and low level reverse engineering of neural networks (attempting to fully characterize an algorithm or mechanism; e.g. * Zoom In: An Introduction to Circuits  [71][link] C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, S. Carter. Distill. 2020. [72]DOI: 10.23915/distill.00024.001 * Interpretability in the wild: a circuit for indirect object identification in gpt-2 small  [73][link] K. Wang, A. Variengien, A. Conmy, B. Shlegeris, J. Steinhardt. arXiv preprint arXiv:2211.00593. 2022. * Progress measures for grokking via mechanistic interpretability  [74][link] N. Nanda, L. Chan, T. Lieberum, J. Smith, J. Steinhardt. arXiv preprint arXiv:2301.05217. 2023. * (How) Do Language Models Track State? B.Z. Li, Z.C. Guo, J. Andreas. arXiv preprint arXiv:2503.02854. 2025. [34, 35, 36, 37] ). Methodologically, our work makes heavy use of attribution graphs * Circuit Tracing: Revealing Computational Graphs in Language Models  [75][HTML] E. Ameisen, J. Lindsey, A. Pearce, W. Gurnee, N.L. Turner, B. Chen, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. McDougall, H. Cunningham, T. Henighan, A. Jermyn, A. Jones, A. Persic, Z. Qi, T. Ben Thompson, S. Zimmerman, K. Rivoire, T. Conerly, C. Olah, J. Batson. Transformer Circuits. 2025. * Automatically identifying local and global circuits with linear computation graphs  [76][link] X. Ge, F. Zhu, W. Shu, J. Wang, Z. He, X. Qiu. arXiv preprint arXiv:2405.13868. 2024. * Transcoders find interpretable LLM feature circuits  [77][PDF] J. Dunefsky, P. Chlenski, N. Nanda. Advances in Neural Information Processing Systems, Vol 37, pp. 24375--24410. 2025. [6, 38, 39] with QK attributions * Tracing Attention Computation Through Feature Interactions  [78][HTML] H. Kamath, E. Ameisen, I. Kauvar, R. Luger, W. Gurnee, A. Pearce, S. Zimmerman, J. Batson, T. Conerly, C. Olah, J. Lindsey. Transformer Circuits Thread. 2025. [40] built on top of crosscoders * Sparse Crosscoders for Cross-Layer Features and Model Diffing  [79][HTML] J. Lindsey, A. Templeton, J. Marcus, T. Conerly, J. Batson, C. Olah. 2024. [18] . Linebreaking. Michaud et al. * The Quantization Model of Neural Scaling  [80][link] E.J. Michaud, Z. Liu, U. Girit, M. Tegmark. Thirty-seventh Conference on Neural Information Processing Systems. 2023. [5] identified linebreaking in fixed-width text as one of the top 400 “quanta” of model behavior in the smallest model (70m parameters) in the Pythia suite. Position. Prior interpretability work on positional mechanism has largely focused on token position (e.g., * {GPT-2}'s positional embedding matrix is a helix  [81][link] A. Yedidia. 2023. * The positional embedding matrix and previous-token heads: how do they actually work?  [82][link] A. Yedidia. Alignment Forum. 2023. * Neurons in large language models: Dead, n-gram, positional E. Voita, J. Ferrando, C. Nalmpantis. arXiv preprint arXiv:2309.04827. 2023. * Understanding positional features in layer 0 {SAE}s  [83][link] B. Chughtai, Y. Lau. 2024. * Universal neurons in gpt2 language models  [84][link] W. Gurnee, T. Horsley, Z.C. Guo, T.R. Kheirkhah, Q. Sun, W. Hathaway, N. Nanda, D. Bertsimas. arXiv preprint arXiv:2401.12181. 2024. [25, 26, 41, 42, 43] ). These works have shown that there exist MLP neurons * Neurons in large language models: Dead, n-gram, positional E. Voita, J. Ferrando, C. Nalmpantis. arXiv preprint arXiv:2309.04827. 2023. * Universal neurons in gpt2 language models  [85][link] W. Gurnee, T. Horsley, Z.C. Guo, T.R. Kheirkhah, Q. Sun, W. Hathaway, N. Nanda, D. Bertsimas. arXiv preprint arXiv:2401.12181. 2024. [41, 43] , SAE features * Understanding positional features in layer 0 {SAE}s  [86][link] B. Chughtai, Y. Lau. 2024. [42] , and learned position embeddings * {GPT-2}'s positional embedding matrix is a helix  [87][link] A. Yedidia. 2023. [25] with periodic structure encoding absolute token position. Our work illustrates how a model might also want to construct non-token based position schemes that are more natural for many downstream prediction tasks. Others have also studied, even going back to LSTMs, the existence of mechanisms in language models for controlling the length of output responses * Why neural translations are the right length X. Shi, K. Knight, D. Yuret. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2278--2282. 2016. * Length Representations in Large Language Models S. Moon, D. Choi, J. Kwon, H. Kamigaito, M. Okumura. arXiv preprint arXiv:2507.20398. 2025. [44, 45] , as well as performed more theoretical analyses of the space of counting algorithms * LSTM networks can perform dynamic counting M. Suzgun, S. Gehrmann, Y. Belinkov, S.M. Shieber. arXiv preprint arXiv:1906.03648. 2019. * Language models need inductive biases to count inductively Y. Chang, Y. Bisk. arXiv preprint arXiv:2405.20131. 2024. [46, 47] . Geometry and Feature Manifolds. Beyond position, there has been extensive work in understanding the geometric representation of numbers, especially in toy models (e.g., * Progress measures for grokking via mechanistic interpretability  [88][link] N. Nanda, L. Chan, T. Lieberum, J. Smith, J. Steinhardt. arXiv preprint arXiv:2301.05217. 2023. * The clock and the pizza: Two stories in mechanistic explanation of neural networks  [89][PDF] Z. Zhong, Z. Liu, M. Tegmark, J. Andreas. Advances in neural information processing systems, Vol 36, pp. 27223--27250. 2023. * Feature emergence via margin maximization: case studies in algebraic tasks D. Morwani, B.L. Edelman, C. Oncescu, R. Zhao, S. Kakade. arXiv preprint arXiv:2311.07568. 2023. [36, 48, 49] ) and in the context of arithmetic in LLMs (e.g., * A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis  [90][link] A. Stolfo, Y. Belinkov, M. Sachan. arXiv preprint arXiv:2305.15054. 2023. * Pre-trained large language models use fourier features to compute addition  [91][link] T. Zhou, D. Fu, V. Sharan, R. Jia. arXiv preprint arXiv:2406.03445. 2024. * Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics  [92][link] Y. Nikankin, A. Reusch, A. Mueller, Y. Belinkov. 2024. * Language Models Use Trigonometry to Do Addition  [93][link] S. Kantamneni, M. Tegmark. 2025. * Understanding In-context Learning of Addition via Activation Subspaces X. Hu, K. Yin, M.I. Jordan, J. Steinhardt, L. Chen. arXiv preprint arXiv:2505.05145. 2025. [50, 51, 52, 53, 54] ). Collectively, these works have shown that both real LLMs and toy transformers learn periodic representations * Pre-trained large language models use fourier features to compute addition  [94][link] T. Zhou, D. Fu, V. Sharan, R. Jia. arXiv preprint arXiv:2406.03445. 2024. * Language Models Use Trigonometry to Do Addition  [95][link] S. Kantamneni, M. Tegmark. 2025. * Understanding In-context Learning of Addition via Activation Subspaces X. Hu, K. Yin, M.I. Jordan, J. Steinhardt, L. Chen. arXiv preprint arXiv:2505.05145. 2025. [51, 53, 54] , with numbers arranged in a helix to enable certain matrix multiplication based addition algorithms * Progress measures for grokking via mechanistic interpretability  [96][link] N. Nanda, L. Chan, T. Lieberum, J. Smith, J. Steinhardt. arXiv preprint arXiv:2301.05217. 2023. * Language Models Use Trigonometry to Do Addition  [97][link] S. Kantamneni, M. Tegmark. 2025. [36, 53] , and that these representations are provably optimal in certain settings * Feature emergence via margin maximization: case studies in algebraic tasks D. Morwani, B.L. Edelman, C. Oncescu, R. Zhao, S. Kakade. arXiv preprint arXiv:2311.07568. 2023. [49] . In our context, we similarly observe helical representations * Language Models Use Trigonometry to Do Addition  [98][link] S. Kantamneni, M. Tegmark. 2025. [53] , numeric dilation * Number Representations in LLMs: A Computational Parallel to Human Perception H. AlquBoj, H. AlQuabeh, V. Bojkovic, T. Hiraoka, A.O. El-Shangiti, M. Nwadike, K. Inui. arXiv preprint arXiv:2502.16147. 2025. [55] , and distributed algorithms across components that collectively implement a correct computation * How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model  [99][PDF] M. Hanna, O. Liu, A. Variengien. Advances in Neural Information Processing Systems, Vol 36, pp. 76033--76060. 2023. * Understanding In-context Learning of Addition via Activation Subspaces X. Hu, K. Yin, M.I. Jordan, J. Steinhardt, L. Chen. arXiv preprint arXiv:2505.05145. 2025. [56, 54] . Multidimensional features with clear geometric structure have been found in more natural contexts * Successor Heads: Recurring, Interpretable Attention Heads In The Wild  [100][link] R. Gould, E. Ong, G. Ogden, A. Conmy. 2023. * Not All Language Model Features Are One-Dimensionally Linear  [101][link] J. Engels, E.J. Michaud, I. Liao, W. Gurnee, M. Tegmark. The Thirteenth International Conference on Learning Representations. 2025. * The Origins of Representation Manifolds in Large Language Models A. Modell, P. Rubin-Delanchy, N. Whiteley. arXiv preprint arXiv:2505.18235. 2025. [57, 4, 7] , like in the representation and computation of certain ordinal relationships (e.g., months of the year). In vision models, curve detector neurons * Curve Detectors  [102][link] N. Cammarata, G. Goh, S. Carter, L. Schubert, M. Petrov, C. Olah. Distill. 2020. [58] and features * The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision  [103][link] L. Gorton. arXiv preprint arXiv:2406.03662. 2024. [20] have been especially well studied and closely resemble the kind of discretization we observe with the families of character counting features. Many other topics have received interpretability analysis of the underlying geometry, such as grammatical relations * A structural probe for finding syntax in word representations  [104][PDF] J. Hewitt, C.D. Manning. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4129--4138. 2019. [105]DOI: 10.18653/v1/N19-1419 * Visualizing and measuring the geometry of BERT  [106][PDF] A. Coenen, E. Reif, A. Yuan, B. Kim, A. Pearce, F. Viégas, M. Wattenberg. Advances in Neural Information Processing Systems, Vol 32. 2019. [10, 11] , multilingual representations * The geometry of multilingual language model representations T.A. Chang, Z. Tu, B.K. Bergen. arXiv preprint arXiv:2205.10964. 2022. [12] , truth * The geometry of truth: Emergent linear structure in large language model representations of true/false datasets  [107][link] S. Marks, M. Tegmark. arXiv preprint arXiv:2310.06824. 2023. [59] , binding * How do language models bind entities in context?  [108][link] J. Feng, J. Steinhardt. arXiv preprint arXiv:2310.17191. 2023. [60] , refusal * The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence  [109][link] T. Wollschlager, J. Elstner, S. Geisler, V. Cohen-Addad, S. Gunnemann, J. Gasteiger. arXiv preprint arXiv:2502.17420. 2025. [16] , features * Projecting assumptions: The duality between sparse autoencoders and concept geometry S.S.R. Hindupur, E.S. Lubana, T. Fel, D. Ba. arXiv preprint arXiv:2503.01822. 2025. * The geometry of concepts: Sparse autoencoder feature structure Y. Li, E.J. Michaud, D.D. Baek, J. Engels, X. Sun, M. Tegmark. Entropy, Vol 27(4), pp. 344. MDPI. 2025. [17, 15] , and hierarchy * The geometry of categorical and hierarchical concepts in large language models K. Park, Y.J. Choe, Y. Jiang, V. Veitch. arXiv preprint arXiv:2406.01506. 2024. [14] , though more conceptual research is needed * Relational composition in neural networks: A survey and call to action M. Wattenberg, F.B. Viegas. arXiv preprint arXiv:2407.14662. 2024. [13] . Perhaps most relevant is recent work from Modell et al. * The Origins of Representation Manifolds in Large Language Models A. Modell, P. Rubin-Delanchy, N. Whiteley. arXiv preprint arXiv:2505.18235. 2025. [7] , who provide a more formal notion of a feature manifold, and propose that cosine similarity encodes the intrinsic geometry of features. When testing their theory, they observe highly structured and interpretable data manifolds that have ripples and dilation, similar to our counting manifolds. These observations raise a methodological challenge in how to best capture data with different structure (see e.g. * Projecting assumptions: The duality between sparse autoencoders and concept geometry S.S.R. Hindupur, E.S. Lubana, T. Fel, D. Ba. arXiv preprint arXiv:2503.01822. 2025. * Understanding sparse autoencoder scaling in the presence of feature manifolds  [110][PDF] E.J. Michaud, L. Gorton, T. McGrath. 2025. * Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning X. Huang, M. Hahn. arXiv preprint arXiv:2508.01916. 2025. [17, 61, 62] ), but also the exciting hypothesis that many naturally continuous variables (e.g., * Monotonic representation of numeric properties in language models B. Heinzerling, K. Inui. arXiv preprint arXiv:2403.10381. 2024. * Language Models Represent Space and Time  [111][link] W. Gurnee, M. Tegmark. 2024. [63, 64] ) exist in more organized manifolds. Biological Analogues. The geometric and algorithmic patterns we observe have suggestive parallels to perception in biological neural systems. Our character count features are analogous to place cells on a 1-D track * Place cells, grid cells, and the brain's spatial representation system.  [112][link] E.I. Moser, E. Kropff, M. Moser. Annual review of neuroscience, Vol 31, pp. 69-89. 2008. [21] and our boundary detecting features are analogous to boundary cells * Representation of geometric borders in the entorhinal cortex T. Solstad, C.N. Boccara, E. Kropff, M. Moser, E.I. Moser. Science, Vol 322(5909), pp. 1865--1868. American Association for the Advancement of Science. 2008. [27] . These features exhibit dilation—representing increasingly large character counts activating over increasingly large ranges—mirroring the dilation of number representations in biological brains * The neural basis of the Weber--Fechner law: a logarithmic mental number line S. Dehaene. Trends in cognitive sciences, Vol 7(4), pp. 145--147. Elsevier. 2003. * Tuning curves for approximate numerosity in the human intraparietal sulcus M. Piazza, V. Izard, P. Pinel, D. Le Bihan, S. Dehaene. Neuron, Vol 44(3), pp. 547--555. Elsevier. 2004. [22, 23] . Moreover, the organization of the features on a low dimensional manifold is an instance of a common motif in biological cognition (e.g., * A neural manifold view of the brain M.G. Perich, D. Narain, J.A. Gallego. Nature Neuroscience, pp. 1--16. Nature Publishing Group US New York. 2025. [65] ). While the analogies are not perfect, we suspect that there is still fruitful conceptual overlap from increased collaboration between neuroscience and interpretability * Position: An inner interpretability framework for AI inspired by lessons from cognitive neuroscience M.G. Vilas, F. Adolfi, D. Poeppel, G. Roig. arXiv preprint arXiv:2406.01352. 2024. * Multilevel interpretability of artificial neural networks: leveraging framework and methods from neuroscience Z. He, J. Achterberg, K. Collins, K. Nejad, D. Akarca, Y. Yang, W. Gurnee, I. Sucholutsky, Y. Tang, R. Ianov, others. arXiv preprint arXiv:2408.12664. 2024. * Cognitively Inspired Interpretability in Large Neural Networks A. Leshinskaya, T. Webb, E. Pavlick, J. Feng, G. Opielka, C. Stevenson, I.A. Blank. Proceedings of the Annual Meeting of the Cognitive Science Society, Vol 47. 2025. [66, 67, 68] . [113]Discussion In this paper, we studied the steps involved in a large model performing a naturalistic behavior. The linebreaking task, frequently encountered in training, requires the model to represent and compute a number of scalar quantities involving position in character count units that are not explicit in its input or output ^27 Tokens do not come annotated with character counts, and there are no vertical bars on the page showing the line width., then integrate those values with the outputs of complex semantic circuits (that predict the next proper word) to predict the next token. We found sparse features corresponding to each important step of the computation, and for those steps involving scalar quantities, we were able to find a geometric description that significantly simplified the interpretation of the algorithm used by the model. We now reflect on what we learned from that process: Naturalistic Behavior and Sensory Processing. Deep mechanistic case studies benefit from choosing behaviors that the model performs consistently well, as these are more likely to have crisper mechanisms. This means prioritizing tasks that are natural in pretraining over tasks that seem natural to human investigators, and ideally, that are easily supervisable. As in biological neuroscience, perceptual tasks are often both natural and easy to supervise for interpretability (e.g., it is easy to modify the input in a programmatic way). Although we sometimes describe the early layers of language models as responsible for “detokenizing” the input * Softmax Linear Units  [114][HTML] N. Elhage, T. Hume, C. Olsson, N. Nanda, T. Henighan, S. Johnston, S. ElShowk, N. Joseph, N. DasSarma, B. Mann, D. Hernandez, A. Askell, K. Ndousse, A. Jones, D. Drain, A. Chen, Y. Bai, D. Ganguli, L. Lovitt, Z. Hatfield-Dodds, J. Kernion, T. Conerly, S. Kravec, S. Fort, S. Kadavath, J. Jacobson, E. Tran-Johnson, J. Kaplan, J. Clark, T. Brown, S. McCandlish, D. Amodei, C. Olah. Transformer Circuits Thread. 2022. * Finding Neurons in a Haystack: Case Studies with Sparse Probing  [115][link] W. Gurnee, N. Nanda, M. Pauly, K. Harvey, D. Troitskii, D. Bertsimas. arXiv preprint arXiv:2305.01610. 2023. * Information flow routes: Automatically interpreting language models at scale J. Ferrando, E. Voita. arXiv preprint arXiv:2403.00824. 2024. * The remarkable robustness of llms: Stages of inference? V. Lad, J.H. Lee, W. Gurnee, M. Tegmark. arXiv preprint arXiv:2406.19384. 2024. [69, 70, 71, 72] , it is perhaps more evocative to think of this as perception. The beginning of the model is really responsible for seeing the input, and much of the early circuitry is in service of sensing or perceiving the text similar to how early layers in vision models implement low level perception * Zoom In: An Introduction to Circuits  [116][link] C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, S. Carter. Distill. 2020. [117]DOI: 10.23915/distill.00024.001 * Beyond the doors of perception: Vision transformers represent relations between objects M. Lepori, A. Tartaglini, W.K. Vong, T. Serre, B.M. Lake, E. Pavlick. Advances in Neural Information Processing Systems, Vol 37, pp. 131503--131544. 2024. [34, 73] . The Utility of Geometry. Many of the representations and computations we studied had elegant geometric interpretations. For example, the counting manifolds are the result of an optimal tradeoff between capacity and resolution, with deep connections to space-filling curves and Fourier features. The boundary head twist was especially beautiful, and after discovering one such head, we were able to correctly predict that there would need to be additional heads to provide curvature in the output. The distributed character counting algorithm was more complex, but we were still able to clarify our view by studying linear actions on these manifolds. For other computations, like the final breaking decision, the linear separation was clearly a part of the story but there must be some additional complexity we were not able to see yet to handle multitoken outputs. For the more semantic operations, we purely relied on the feature view. Of course, describing any behavior in full is immensely complicated, and there is a long list of possible subtleties we did not study: how the model accounts for uncertainty in its counting, its mechanism for estimating the line width given multiple prior lines of text, how it adapts to documents with variable line width, how it handles multiple plausible output tokens of different lengths or multitoken words, or various special cases (e.g., a LaTeX \footnote{} or a markdown link). For the inspired, we share transcoder attribution graphs for a fixed-width line break prompt on [118]Gemma 2 2B and [119]Qwen 3 4B, using the new neuronpedia interactive interface. Unsupervised Discovery It likely would not have been possible to develop this clarity if it were not for the unsupervised sparse features. In fact, when we started this project, we attempted to just probe and patch our way to understanding, but this turned out poorly. Specifically, we did not understand what we were looking for (e.g. we didn’t know to distinguish line width vs. character count), where to look for it (e.g., we didn’t expect line width to only be represented on the newline), or how to look for it (we started by training 1-D linear regression probes). However, after identifying some relevant features but before spending substantial effort systematically characterizing their activity profiles, we were also confused by what they were representing. We saw dozens of features that were vaguely about newlines and linebroken text, but their differences were not obvious from flipping through the activating examples. Only after we tested these features on synthetic datasets did their role in the graph and the underlying computation become clear. We suspect better automatic labels * Language models can explain neurons in language models  [120][HTML] S. Bills, N. Cammarata, D. Mossing, H. Tillman, L. Gao, G. Goh, I. Sutskever, J. Leike, J. Wu, W. Saunders. 2023. * Automatically interpreting millions of features in large language models G. Paulo, A. Mallen, C. Juang, N. Belrose. arXiv preprint arXiv:2410.13928. 2024. * Enhancing automated interpretability with output-centric feature descriptions Y. Gur-Arieh, R. Mayan, C. Agassy, A. Geiger, M. Geva. arXiv preprint arXiv:2501.08319. 2025. [74, 75, 76] enhanced with agentic workflows * A multimodal automated interpretability agent T.R. Shaham, S. Schwettmann, F. Wang, A. Rajaram, E. Hernandez, J. Andreas, A. Torralba. Forty-first International Conference on Machine Learning. 2024. * Building and evaluating alignment auditing agents T. Bricken, R. Wang, S. Bowman, E. Ong, J. Treutlein, J. Wu, E. Hubinger, S. Marks. 2025. [77, 78] would accelerate this work, especially in less verifiable domains. Feature-Manifold Duality. ​​The discrete feature and geometric feature-manifold perspectives offer dual lenses on the same underlying object. For example, in this work the model's representation of character count can be completely described (modulo reconstruction error) by the activities of the features we identified, where the action of the boundary heads is described by virtual weights that expand out the feature interactions via attention head matrices. The same character count representation can be described by a 1-dimensional feature manifold – a curve in the residual stream parametrized by the character count variable – where linear action of the boundary heads is described by continuous “twisting” of the manifold. In general, geometric structures learned by the model will likely admit both global parametrizations and local discrete approximations. The Complexity Tax. Despite this duality, the descriptions produced by the two perspectives differ in their simplicity. The discrete features shatter the model into many pieces, producing a complex understanding of the computation. This seems like a general lesson. It seems like discrete features and attribution graphs may provide a true description of model computation, which can be found in an automated way using dictionary learning. Getting any true, understandable description of the computation is a very non-trivial victory! However, if we stop there, and don't understand additional structure which is present, we pay a complexity tax, where we understand things in a needlessly complicated way. In the line breaking problem, constructing the manifold paid down this tax, but one could imagine other ways of reducing the interpretation burden. A Call for Methodology. Armed with our feature understanding, we were able to directly search for the relevant geometric structures. This was an existence proof more than a general recipe, and we need methods that can automatically surface simpler structures to pay down the complexity tax. In our setting, this meant studying feature manifolds, and it would be nice to see unsupervised approaches to detecting them. In other cases we will need yet other tools to reduce the interpretation burden, like finding hierarchical representations * From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit V. Costa, T. Fel, E.S. Lubana, B. Tolooshams, D. Ba. arXiv preprint arXiv:2506.03093. 2025. [8] or macroscopic structure * Interpretability Dreams  [121][HTML] C. Olah. 2023. [9] in the global weights * Circuit Tracing: Revealing Computational Graphs in Language Models  [122][HTML] E. Ameisen, J. Lindsey, A. Pearce, W. Gurnee, N.L. Turner, B. Chen, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. McDougall, H. Cunningham, T. Henighan, A. Jermyn, A. Jones, A. Persic, Z. Qi, T. Ben Thompson, S. Zimmerman, K. Rivoire, T. Conerly, C. Olah, J. Batson. Transformer Circuits. 2025. [6] . A Call for Biology. The model must perform other elegant computations. We can find these by starting with a specific task the model performs well, study this from multiple perspectives, develop methodology to answer the remaining questions, and relentlessly attempt to simplify our explanations. Because the investigation is grounded in specific examples of a behavior, it provides a fast feedback loop, can shed light on weaknesses of existing methods and inspire new ones, and can sharpen our conceptual language for understanding neural networks. We would be excited to see more deep case studies that adopt this approach. [123]Citation Information For attribution in academic contexts, please cite this work as Gurnee, et al., "When Models Manipulate Manifolds: The Geometry of a Counting Ta sk", Transformer Circuits, 2025. BibTeX citation @article{gurnee2025when, author={Gurnee, Wes and Ameisen, Emmanuel and Kauvar, Isaac and Tarng ,Julius and Pearce, Adam and Olah, Chris and Batson, Joshua}, title={When Models Manipulate Manifolds: The Geometry of a Counting Task}, journal={Transformer Circuits Thread}, year={2025}, url={https://transformer-circuits.pub/2025/linebreaks/index.html} } [124]Acknowledgments We would like to thank the following people who reviewed an early version of the manuscript and provided helpful feedback that we used to improve the final version: Owen Lewis, Tom McGrath, Eric Michaud, Alexander Modell, Patrick Rubin-Delanchy, Nicholas Sofroniew, and Martin Wattenberg. We are also thankful to all the members of the interpretability team for their helpful discussion and feedback, especially Doug Finkbeiner for discussions of rippling and ringing, Jack Lindsey on framing, Tom Henighan for feedback on clarity, Brian Chen for improving the design of the figures and line edits of the text, and the team who built the attribution graph * Circuit Tracing: Revealing Computational Graphs in Language Models  [125][HTML] E. Ameisen, J. Lindsey, A. Pearce, W. Gurnee, N.L. Turner, B. Chen, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. McDougall, H. Cunningham, T. Henighan, A. Jermyn, A. Jones, A. Persic, Z. Qi, T. Ben Thompson, S. Zimmerman, K. Rivoire, T. Conerly, C. Olah, J. Batson. Transformer Circuits. 2025. [6] and QK attribution infrastructure * Tracing Attention Computation Through Feature Interactions  [126][HTML] H. Kamath, E. Ameisen, I. Kauvar, R. Luger, W. Gurnee, A. Pearce, S. Zimmerman, J. Batson, T. Conerly, C. Olah, J. Lindsey. Transformer Circuits Thread. 2025. [40] . [127]Haiku Task Performance Haiku is able to adapt to the line length for every value of [MATH: <semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics> :MATH] k k, predicting newlines at the correct positions with high probability by the third line. Of course, some error is to be expected even with a perfect estimate of line length, as the model may incorrectly predict the next semantic token. Below is the mean log-prob and accuracy for newline prediction of Haiku on 200 prose sequences that were synthetically wrapped to have lines of character length [MATH: <semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics> :MATH] k k, for [MATH: <semantics><mrow><mi>k</mi><mo>=</mo><mn>2</mn><mn>0</mn><mo separator="true">,</mo><mn>4</mn><mn>0</mn><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><mn>1</mn><mn>4</mn><mn>0</mn></mrow><annotation encoding="application/x-tex">k = 20, 40, \ldots, 140</annotation></semantics> :MATH] k=20,40,…,140 k = 20, 40, \ldots, 140. [nuRBpfwAAAABJRU5ErkJggg==] [128]Feature Splitting and Universality It is natural to ask if the character counting features are fundamental, or simply one discretization of the space among many. We found that dictionaries of different sizes learn features with very similar receptive fields, so this featurization – including the slowly dilating widths – is in some sense canonical. We hypothesize that this canonical structure emerges from boundary constraint: positions near zero (start of line) create a natural anchoring point for feature development. [8B%0A9HmapZ1SqssAAAAASUVORK5CYII=] The geometry of the decoder directions is also fairly consistent between the dictionaries, showing characteristic ringing. [D+nt9IAaQzG0AAAAAElFTkSuQmCC] However, we do see some evidence of feature splitting. For example, below are three character count feature which activate on the same interval (~20–45 characters in the line), but differentially activate for lines of different widths: LCC2.a activates on all line widths, LCC2.b preferentially activates on long line widths, and LCC2.c preferentially activates when close to the line width boundary. [APk4UrpAit1BAAAAAElFTkSuQmCC] Recent work has raised the possibility that feature dictionaries could behave pathologically where there exist feature manifolds * Understanding sparse autoencoder scaling in the presence of feature manifolds  [129][PDF] E.J. Michaud, L. Gorton, T. McGrath. 2025. [61] , because a dictionary could allocate an increasing number of features in a finer tiling of the space. However, our observation that cross-coders of varying size tile this feature manifold in a canonical way suggest that this behavior does not occur in this setting. [130]Line Width Features Line width features tile the space similar to character count features. [Cy8xBJTMNWIAAAAASUVORK5CYII=] [131]Dynamical System Model We simulate [MATH: <semantics><mrow><mi>N</mi><mo>=</mo><mn>1</mn><mn>0</mn><mn>0</mn></mr ow><annotation encoding="application/x-tex">N = 100</annotation></semantics> :MATH] N=100 N = 100 points on the unit [MATH: <semantics><mrow><mo>(</mo><mi>n</mi><mo>−</mo><mn>1</mn><mo>)</mo></mr ow><annotation encoding="application/x-tex">(n-1)</annotation></semantics> :MATH] (n−1) (n-1)-sphere in [MATH: <semantics><mrow><msup><mi mathvariant="double-struck">R</mi><mi>n</mi></msup></mrow><annotation encoding="application/x-tex">\mathbb{R}^n</annotation></semantics> :MATH] Rn \mathbb{R}^n ( [MATH: <semantics><mrow><mi>n</mi><mo>∈</mo><mo>{</mo><mn>3</mn><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><mn>8</mn><mo>}</mo></mrow><annotation encoding="application/x-tex">n \in \{3,\ldots,8\}</annotation></semantics> :MATH] n∈{3,…,8} n \in \{3,\ldots,8\}) with pairwise forces: [MATH: <semantics><mrow><msub><mi mathvariant="bold">F</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>=< /mo><mrow><mo fence="true">{</mo><mtable><mtr><mtd><mstyle scriptlevel="0" displaystyle="false"><mrow><mfrac><mrow><mn>1</mn><mo>−</mo><mo>(</mo>< msub><mi>d</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>−</mo><mn>1< /mn><mo>)</mo><mi mathvariant="normal">/</mi><mn>2</mn></mrow><mrow><msub><mi>r</mi><mrow ><mi>i</mi><mi>j</mi></mrow></msub></mrow></mfrac><msub><mover accent="true"><mrow><mrow><mi mathvariant="bold">r</mi></mrow></mrow><mo>^</mo></mover><mrow><mi>i</m i><mi>j</mi></mrow></msub></mrow></mstyle></mtd><mtd><mstyle scriptlevel="0" displaystyle="false"><mrow><mtext>when </mtext><msub><mi>d</mi><mrow><m i>i</mi><mi>j</mi></mrow></msub><mo>≤</mo><mi>w</mi></mrow></mstyle></m td></mtr><mtr><mtd><mstyle scriptlevel="0" displaystyle="false"><mrow><mo>−</mo><mfrac><mrow><mi>min</mi><mo>(</mo ><mn>5</mn><mo separator="true">,</mo><mn>1</mn><mi mathvariant="normal">/</mi><msub><mi>r</mi><mrow><mi>i</mi><mi>j</mi></ mrow></msub><mo>)</mo></mrow><mrow><msub><mi>r</mi><mrow><mi>i</mi><mi> j</mi></mrow></msub></mrow></mfrac><msub><mover accent="true"><mrow><mrow><mi mathvariant="bold">r</mi></mrow></mrow><mo>^</mo></mover><mrow><mi>i</m i><mi>j</mi></mrow></msub></mrow></mstyle></mtd><mtd><mstyle scriptlevel="0" displaystyle="false"><mrow><mtext>when </mtext><msub><mi>d</mi><mrow><m i>i</mi><mi>j</mi></mrow></msub><mo>></mo><mi>w</mi></mrow></mstyle></m td></mtr></mtable></mrow></mrow><annotation encoding="application/x-tex">\mathbf{F}_{ij} = \begin{cases} \frac{1 - (d_{ij} - 1)/2}{r_{ij}} \hat{\mathbf{r}}_{ij} & \text{when }d_{ij} \leq w \\ -\frac{\min(5, 1/r_{ij})}{r_{ij}} \hat{\mathbf{r}}_{ij} & \text{when }d_{ij} > w \end{cases}</annotation></semantics> :MATH] Fij​={rij​1−(dij​−1)/2​r^ij​−rij​min(5,1/rij​)​r^ij​​when dij​≤wwhen di j​>w​ \mathbf{F}_{ij} = \begin{cases} \frac{1 - (d_{ij} - 1)/2}{r_{ij}} \hat{\mathbf{r}}_{ij} & \text{when }d_{ij} \leq w \\ -\frac{\min(5, 1/r_{ij})}{r_{ij}} \hat{\mathbf{r}}_{ij} & \text{when }d_{ij} > w \end{cases}, where [MATH: <semantics><mrow><msub><mi>r</mi><mrow><mi>i</mi><mi>j</mi></mrow></msu b><mo>=</mo><mi mathvariant="normal">∥</mi><msub><mi mathvariant="bold">x</mi><mi>j</mi></msub><mo>−</mo><msub><mi mathvariant="bold">x</mi><mi>i</mi></msub><mi mathvariant="normal">∥</mi></mrow><annotation encoding="application/x-tex">r_{ij} = \|\mathbf{x}_j - \mathbf{x}_i\|</annotation></semantics> :MATH] rij​=∥xj​−xi​∥ r_{ij} = \|\mathbf{x}_j - \mathbf{x}_i\|, [MATH: <semantics><mrow><msub><mover accent="true"><mrow><mrow><mi mathvariant="bold">r</mi></mrow></mrow><mo>^</mo></mover><mrow><mi>i</m i><mi>j</mi></mrow></msub><mo>=</mo><mo>(</mo><msub><mi mathvariant="bold">x</mi><mi>j</mi></msub><mo>−</mo><msub><mi mathvariant="bold">x</mi><mi>i</mi></msub><mo>)</mo><mi mathvariant="normal">/</mi><msub><mi>r</mi><mrow><mi>i</mi><mi>j</mi></ mrow></msub></mrow><annotation encoding="application/x-tex">\hat{\mathbf{r}}_{ij} = (\mathbf{x}_j - \mathbf{x}_i)/r_{ij}</annotation></semantics> :MATH] r^ij​=(xj​−xi​)/rij​ \hat{\mathbf{r}}_{ij} = (\mathbf{x}_j - \mathbf{x}_i)/r_{ij}, [MATH: <semantics><mrow><mi>w</mi></mrow><annotation encoding="application/x-tex">w</annotation></semantics> :MATH] w w is the attractive zone width parameter, and [MATH: <semantics><mrow><msub><mi>d</mi><mrow><mi>i</mi><mi>j</mi></mrow></msu b><mo>=</mo><mi>min</mi><mo>(</mo><mi mathvariant="normal">∣</mi><mi>j</mi><mo>−</mo><mi>i</mi><mi mathvariant="normal">∣</mi><mo separator="true">,</mo><mi mathvariant="normal">∣</mi><mi>j</mi><mo>−</mo><mi>i</mi><mo>+</mo><mi> N</mi><mi mathvariant="normal">∣</mi><mo separator="true">,</mo><mi mathvariant="normal">∣</mi><mi>j</mi><mo>−</mo><mi>i</mi><mo>−</mo><mi> N</mi><mi mathvariant="normal">∣</mi><mo>)</mo></mrow><annotation encoding="application/x-tex">d_{ij} = \min(|j-i|, |j-i+N|, |j-i-N|)</annotation></semantics> :MATH] dij​=min(∣j−i∣,∣j−i+N∣,∣j−i−N∣) d_{ij} = \min(|j-i|, |j-i+N|, |j-i-N|) is the index distance (for the circular topology; for the interval it is just [MATH: <semantics><mrow><msub><mi>d</mi><mrow><mi>i</mi><mi>j</mi></mrow></msu b><mo>=</mo><mi mathvariant="normal">∣</mi><mi>j</mi><mo>−</mo><mi>i</mi><mi mathvariant="normal">∣</mi></mrow><annotation encoding="application/x-tex">d_{ij} = |j-i|</annotation></semantics> :MATH] dij​=∣j−i∣ d_{ij} = |j-i|). Evolution follows [MATH: <semantics><mrow><msub><mover accent="true"><mrow><mrow><mi mathvariant="bold">v</mi></mrow></mrow><mo>˙</mo></mover><mi>i</mi></ms ub><mo>=</mo><msub><mo>∑</mo><mrow><mi>j</mi><mo>≠</mo><mi>i</mi></mrow ></msub><msub><mi mathvariant="bold">F</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>−< /mo><mn>0</mn><mi mathvariant="normal">.</mi><mn>0</mn><mn>5</mn><msub><mi mathvariant="bold">v</mi><mi>i</mi></msub></mrow><annotation encoding="application/x-tex">\dot{\mathbf{v}}_i = \sum_{j \neq i} \mathbf{F}_{ij} - 0.05\mathbf{v}_i</annotation></semantics> :MATH] v˙i​=∑j≠i​Fij​−0.05vi​ \dot{\mathbf{v}}_i = \sum_{j \neq i} \mathbf{F}_{ij} - 0.05\mathbf{v}_i and [MATH: <semantics><mrow><msub><mover accent="true"><mrow><mrow><mi mathvariant="bold">x</mi></mrow></mrow><mo>˙</mo></mover><mi>i</mi></ms ub><mo>=</mo><msub><mi mathvariant="bold">v</mi><mi>i</mi></msub></mrow><annotation encoding="application/x-tex">\dot{\mathbf{x}}_i = \mathbf{v}_i</annotation></semantics> :MATH] x˙i​=vi​ \dot{\mathbf{x}}_i = \mathbf{v}_i with sphere constraint [MATH: <semantics><mrow><msub><mi mathvariant="bold">x</mi><mi>i</mi></msub><mo>←</mo><msub><mi mathvariant="bold">x</mi><mi>i</mi></msub><mi mathvariant="normal">/</mi><mi mathvariant="normal">∥</mi><msub><mi mathvariant="bold">x</mi><mi>i</mi></msub><mi mathvariant="normal">∥</mi></mrow><annotation encoding="application/x-tex">\mathbf{x}_i \leftarrow \mathbf{x}_i/\|\mathbf{x}_i\|</annotation></semantics> :MATH] xi​←xi​/∥xi​∥ \mathbf{x}_i \leftarrow \mathbf{x}_i/\|\mathbf{x}_i\| enforced after each timestep ( [MATH: <semantics><mrow><mi mathvariant="normal">Δ</mi><mi>t</mi><mo>=</mo><mn>0</mn><mi mathvariant="normal">.</mi><mn>0</mn><mn>1</mn></mrow><annotation encoding="application/x-tex">\Delta t = 0.01</annotation></semantics> :MATH] Δt=0.01 \Delta t = 0.01, damping [MATH: <semantics><mrow><mi>α</mi><mo>=</mo><mn>0</mn><mi mathvariant="normal">.</mi><mn>9</mn><mn>5</mn></mrow><annotation encoding="application/x-tex">\alpha = 0.95</annotation></semantics> :MATH] α=0.95 \alpha = 0.95). [132]Analytic Construction of Ringing and Fourier Modes We explore a deeper connection between the ringing observed in the character count feature manifold, and a connection to Fourier analysis in an analytical construction. Suppose that we wish to have a discretized circle's worth of unit vectors, each similar to its neighbors but orthogonal to those further away. Then the cosine similarity matrix [MATH: <semantics><mrow><mi>X</mi></mrow><annotation encoding="application/x-tex">X</annotation></semantics> :MATH] X X of these will be the circulant matrix of a narrow-peaked function [MATH: <semantics><mrow><mi>f</mi></mrow><annotation encoding="application/x-tex">f</annotation></semantics> :MATH] f f (left, below). The columns of the square root of [MATH: <semantics><mrow><mi>X</mi></mrow><annotation encoding="application/x-tex">X</annotation></semantics> :MATH] X X are [MATH: <semantics><mrow><mi>n</mi></mrow><annotation encoding="application/x-tex">n</annotation></semantics> :MATH] n n vectors [MATH: <semantics><mrow><msub><mi>v</mi><mn>1</mn></msub><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msub><mi>v</mi><mi>n</mi></msub><mo>∈</mo><msup ><mi mathvariant="double-struck">R</mi><mi>n</mi></msup></mrow><annotation encoding="application/x-tex">v_1,\ldots,v_n \in \mathbb{R}^n</annotation></semantics> :MATH] v1​,…,vn​∈Rn v_1,\ldots,v_n \in \mathbb{R}^n, where [MATH: <semantics><mrow><mi>n</mi></mrow><annotation encoding="application/x-tex">n</annotation></semantics> :MATH] n n is the number of discrete points on the circle, whose inner products reproduce the similarity matrix. ^28 The entire continuous circle embeds into the infinite-dimensional Hilbert space [MATH: <semantics><mrow><msup><mi>L</mi><mn>2</mn></msup><mrow><msup><mi mathvariant="double-struck">S</mi><mn>1</mn></msup></mrow></mrow><annot ation encoding="application/x-tex">L^2\mathbb{S^1}</annotation></semantics> :MATH] L2S1 L^2\mathbb{S^1} via this construction. Now suppose we would like to find vectors in a lower-dimensional space whose similarity matrix approximates [MATH: <semantics><mrow><mi>X</mi></mrow><annotation encoding="application/x-tex">X</annotation></semantics> :MATH] X X, like the model does for character counts. The best [MATH: <semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics> :MATH] k k-dimensional approximation of this similarity, in an [MATH: <semantics><mrow><msup><mi>L</mi><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">L^2</annotation></semantics> :MATH] L2 L^2 sense, is given by taking the eigendecomposition of [MATH: <semantics><mrow><mi>X</mi></mrow><annotation encoding="application/x-tex">X</annotation></semantics> :MATH] X X and truncating it to the top [MATH: <semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics> :MATH] k k eigenvectors; the square root of the result will provide [MATH: <semantics><mrow><mi>n</mi></mrow><annotation encoding="application/x-tex">n</annotation></semantics> :MATH] n n vectors in [MATH: <semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics> :MATH] k k-dimensions with the corresponding similarity pattern. If [MATH: <semantics><mrow><msub><mi>π</mi><mi>k</mi></msub></mrow><annotation encoding="application/x-tex">\pi_k</annotation></semantics> :MATH] πk​ \pi_k is the projector onto the span of the top [MATH: <semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics> :MATH] k k eigenvectors, then the images [MATH: <semantics><mrow><msub><mi>π</mi><mi>k</mi></msub><msub><mi>v</mi><mi>i </mi></msub></mrow><annotation encoding="application/x-tex">\pi_k v_i</annotation></semantics> :MATH] πk​vi​ \pi_k v_i, which live in a [MATH: <semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics> :MATH] k k-dimensional subspace, are precisely those vectors. We can see below that the resulting low-rank matrix has ringing ([133]colab notebook). Finally, because the [134]discrete Fourier transform diagonalizes circulant matrices, the Fourier coefficients of [MATH: <semantics><mrow><mi>f</mi></mrow><annotation encoding="application/x-tex">f</annotation></semantics> :MATH] f f are in fact the principal values of [MATH: <semantics><mrow><mi>X</mi></mrow><annotation encoding="application/x-tex">X</annotation></semantics> :MATH] X X; the low-rank approximation consists of truncating the small fourier coefficients of [MATH: <semantics><mrow><mi>f</mi></mrow><annotation encoding="application/x-tex">f</annotation></semantics> :MATH] f f; and the resulting rows of [MATH: <semantics><mrow><mi>X</mi></mrow><annotation encoding="application/x-tex">X</annotation></semantics> :MATH] X X exhibit ringing. [%0Abr1t62IAAAAASUVORK5CYII=] One essential feature of the representation of line character counts is that the “boundary head” twists the representation, enabling each count to pair with a count slightly larger, indicating that the boundary is close. That is, there is a linear map QK which slides the character count curve along itself. Such an action is not admitted by generic high-curvature embeddings of the circle or the interval like the ones in the physical model we constructed. But it is present in both the manifold we observe in Haiku and, as we now show, in the Fourier construction. First, note permutating the coordinates of [MATH: <semantics><mrow><msup><mi mathvariant="double-struck">R</mi><mi>n</mi></msup></mrow><annotation encoding="application/x-tex">\mathbb{R}^n</annotation></semantics> :MATH] Rn \mathbb{R}^n by taking [MATH: <semantics><mrow><msub><mi>e</mi><mi>i</mi></msub><mo>↦</mo><msub><mi>e </mi><mrow><mi>i</mi><mo>+</mo><mn>1</mn></mrow></msub></mrow><annotati on encoding="application/x-tex">e_i \mapsto e_{i+1}</annotation></semantics> :MATH] ei​↦ei+1​ e_i \mapsto e_{i+1} has the effect of mapping [MATH: <semantics><mrow><msub><mi>v</mi><mi>i</mi></msub><mo>↦</mo><msub><mi>v </mi><mrow><mi>i</mi><mo>+</mo><mn>1</mn></mrow></msub></mrow><annotati on encoding="application/x-tex">v_i \mapsto v_{i+1}</annotation></semantics> :MATH] vi​↦vi+1​ v_i \mapsto v_{i+1}. That is, the (linear) action of this permutation [MATH: <semantics><mrow><mi>ρ</mi></mrow><annotation encoding="application/x-tex">\rho</annotation></semantics> :MATH] ρ \rho on [MATH: <semantics><mrow><msup><mi mathvariant="double-struck">R</mi><mi>n</mi></msup></mrow><annotation encoding="application/x-tex">\mathbb{R}^n</annotation></semantics> :MATH] Rn \mathbb{R}^n acts by a rotation of the embedded circle with respect to its intrinsic geometry. Because conjugation by [MATH: <semantics><mrow><mi>ρ</mi></mrow><annotation encoding="application/x-tex">\rho</annotation></semantics> :MATH] ρ \rho fixes the circulant matrix [MATH: <semantics><mrow><mi>X</mi></mrow><annotation encoding="application/x-tex">X</annotation></semantics> :MATH] X X, it therefore respects its eigendecomposition, and thus commutes with the projection [MATH: <semantics><mrow><msub><mi>π</mi><mi>k</mi></msub></mrow><annotation encoding="application/x-tex">\pi_k</annotation></semantics> :MATH] πk​ \pi_k onto the vector space spanned by its top [MATH: <semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics> :MATH] k k eigenvectors. The restriction of [MATH: <semantics><mrow><mi>ρ</mi></mrow><annotation encoding="application/x-tex">\rho</annotation></semantics> :MATH] ρ \rho to that subspace, [MATH: <semantics><mrow><mover accent="true"><mrow><mi>ρ</mi></mrow><mo stretchy="true">‾</mo></mover><mo>:</mo><mo>=</mo><msub><mi>π</mi><mi>k </mi></msub><mo>∘</mo><mi>ρ</mi><mo>∘</mo><msub><mi>π</mi><mi>k</mi></m sub></mrow><annotation encoding="application/x-tex">\overline{\rho}:=\pi_k \circ \rho \circ \pi_k</annotation></semantics> :MATH] ρ​:=πk​∘ρ∘πk​ \overline{\rho}:=\pi_k \circ \rho \circ \pi_k, acts by rotation on the lower-dimensional vectors [MATH: <semantics><mrow><mover accent="true"><mrow><mi>ρ</mi></mrow><mo stretchy="true">‾</mo></mover><mo>:</mo><msub><mi>π</mi><mi>k</mi></msu b><msub><mi>v</mi><mi>i</mi></msub><mo>↦</mo><msub><mi>π</mi><mi>k</mi> </msub><msub><mi>v</mi><mrow><mi>i</mi><mo>+</mo><mn>1</mn></mrow></msu b></mrow><annotation encoding="application/x-tex">\overline{\rho}: \pi_k v_i \mapsto \pi_k v_{i+1}</annotation></semantics> :MATH] ρ​:πk​vi​↦πk​vi+1​ \overline{\rho}: \pi_k v_i \mapsto \pi_k v_{i+1}. Thus we have found [MATH: <semantics><mrow><mi>n</mi></mrow><annotation encoding="application/x-tex">n</annotation></semantics> :MATH] n n vectors in a [MATH: <semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics> :MATH] k k-dimensional space, whose similarity is as close as possible to [MATH: <semantics><mrow><mi>X</mi></mrow><annotation encoding="application/x-tex">X</annotation></semantics> :MATH] X X (and has ringing), with a linear action of [MATH: <semantics><mrow><msup><mi mathvariant="double-struck">R</mi><mi>k</mi></msup></mrow><annotation encoding="application/x-tex">\mathbb{R}^k</annotation></semantics> :MATH] Rk \mathbb{R}^k that rotates the vectors along a rippled embedded circle. We evaluate whether a Fourier decomposition of the character count curve is optimal, and find that it is quite close given that it does not account for dilation. Fourier components explain at most 10% less variance than an equivalent number of PCA components, which are optimal for capturing variance. [zyPg3KER%0AAAAAAElFTkSuQmCC] Finally, we note that as one moves through layers, the representation becomes more peaked. This sharpening of the receptive field is useful to the model to better estimate character counts, and corresponds to higher curvature in the embedding and, as predicted by the model above, more pronounced ringing. Below we show cross-sections (at character count 30, 60, 90, 120) of the cosine similarity matrix of probes trained after layers 0, 1, 2, and 3. With each subsequent layer, the graphs get more tightly peaked and secondary rings go higher. [fmAAAAAElFTkSuQmCC] [135]Geometry of Twisting Different heads access and manipulate the space in different ways. Below, we show the cosine similarity of both probe sets through QK for three heads: one which keeps them aligned, one which shifts character count to align better with later line widths, and one which does the opposite. [9ujzJrHRJAAAAABJRU5ErkJg%0Agg==] We can also look at this transformation by visualizing the Singular Value Decomposition of each set of probes in a joint basis after passing them through QK. Once more, the alignment, left offset, and right offset can be read directly from the components. [ocfiZwCSaktem15sL2pb3q08zco2AqidjCiGEEEKI%0A2Q+6sKlaC81YZnwD+pR4pBvcMe g+YobEDuvTgE3Ff19974fmnDp1qv1OIYT4sZF5LoQQQgghhBBC%0ACCGEEEIIIYQQIntkng shhBBCCCGEEEIIIYQQQgghhMgemedCCCGEEEIIIYQQQgghhBBCCCGyR+a5%0AEEIIIYQQQg ghhBBCCCGEEEKI7JF5LoQQQgghhBBCCCGEEEIIIYQQIntkngshhBBCCCGEEEIIIYQQ%0AQg ghhMgemedCCCGEEEIIIYQQQgghhBBCCCGyR+a5EEIIIYQQQgghhBBCCCGEEEKI7JF5LoQQQ ggh%0AhBBCCCGEEEIIIYQQInv+D27OPAOqEcxuAAAAAElFTkSuQmCC] We can directly plot the first 3 components of the joint probe space after passing them through each QK. Doing so shows that one head keeps the representations aligned, while the others twist them either clockwise or counterclockwise. [ouTrLEWwAAAABJRU5ErkJggg==] [136]Break Predictor Features Boundary detector features (at about ~⅓ model depth) do not take into account the length of the next token. [Em50AAAA%0AAElFTkSuQmCC] Later in the model, there exist features which incorporate both the number of characters remaining and the length of the most likely next token. These features only activate when the most likely next token is longer than the number of characters remaining (i.e. below the red diagonal below), as is the case in our aluminum prompt. [H8FxE%0ARERERERERERERERERHo9huciIiIiIiIiIiIiIiIiItLrMTwXERERERERERERER EREZFej+G5iIiI%0AiIiIiIiIiIiIiIj0egzPRURERERERERERERERESk12N4LiIiIiIiIi IiIiIiIiIivR7DcxERERER%0AERERERERERER6fWMIiIiIiIiIiIiIiIiIiIi0tv5P1DBP6 hNrWhKAAAAAElFTkSuQmCC] We also found features for the converse: features which suppress the newline because the predicted next token is shorter than the number of characters remaining in the line. [b5%0A2UWkOjE8FxERERERERERERERERGRmsfwXERERERERERERET+X3t2IAAAAMAw6P7UB 1lpBACQJ88B%0AAAAAAAAAyJPnAAAAAAAAAOTJcwAAAAAAAADy5DkAAAAAAAAAefIcAAAAA AAAgDx5DgAAAAAAAECe%0APAcAAAAAAAAgT54DAAAAAAAAkCfPAQAAAAAAAMiT5wAAAAAAA ADkDQAAAAAAAADqDq+rTiQIC7vU%0AAAAAAElFTkSuQmCC] Both break prediction and suppression features sometimes also have interpretable logit effects on the output of all tokens, not just the newline. For instance, the features below respectively excite and suppress the newline as their top effect, but also systematically suppress tokens with more characters. This is because if the model is wrong about the value of the next token (and whether it's a newline), the token must at least be short enough to fit on the line. [E8SZIkSZIkSZIkSbL3%0AKp4nSZIkSZIkSZIkSfbefwBA8xGZLiGGzQAAAABJRU5ErkJgg g==] [137]Representing Token Lengths We find layer 0 features that activate as a function of the character count of individual tokens. [2%0AItWBnA8aEgAAAABJRU5ErkJggg==] These features are overlapping (e.g. there are tokens for which the long word and medium word features are both active) and non-exhaustive (none of them fire on some common tokens, where we suspect the representation of character length is partially absorbed * A is for absorption: Studying feature splitting and absorption in sparse autoencoders  [138][link] D. Chanin, J. Wilken-Smith, T. Dulka, H. Bhatnagar, J. Bloom. arXiv preprint arXiv:2409.14507. 2024. [79] into features which just activate for that token). [139]The Mechanics of Head Specialization Heads collaborate to generate the count manifold, but how does each head aggregate counts? As a toy model, consider the following construction for character counting with a single attention head: * The head uses the previous newline token as a “sink” where it defaults all of its attention to (i.e., attention 1) * Each token since the newline gets [MATH: <semantics><mrow><mi>α</mi></mrow><annotation encoding="application/x-tex">\alpha</annotation></semantics> :MATH] α \alpha attention, such that after [MATH: <semantics><mrow><mi>j</mi></mrow><annotation encoding="application/x-tex">j</annotation></semantics> :MATH] j j tokens the newline has [MATH: <semantics><mrow><mn>1</mn><mo>−</mo><mi>α</mi><mi>j</mi></mrow><annota tion encoding="application/x-tex">1 - \alpha j</annotation></semantics> :MATH] 1−αj 1 - \alpha j attention. Note this limits the construction to only work on lines of up to [MATH: <semantics><mrow><mn>1</mn><mi mathvariant="normal">/</mi><mi>α</mi></mrow><annotation encoding="application/x-tex">1 / \alpha</annotation></semantics> :MATH] 1/α 1 / \alpha tokens, but the model could use multiple heads at different offsets to count for longer sequences. The model could also dedicate attention to tokens proportionally to token length The output of the head on the newline is 0, and the output of each non-newline token is a vector with the same direction but magnitude proportional to the character count of the token. This produces a ray with total length proportional to the character count of the line. [A0CD%0Ao3OzOpdLAAAAAElFTkSuQmCC] In practice, we observe that individual attention heads do indeed use the newline as an attention sink, but at different offsets. As an example, we visualize the attention patterns of 4 important Layer 0 heads on several prompts with different line widths (starting from the first newline in the sequence). [zz+XPE9iwnMAAAAAAAAAAAAAAAAAAIAz%0AhfAcAAAAAAAAAAAAAAAAAADcQ3gOAAAAAAA AAAAAAAAAAADuITwHAAAAAAAAAAAAAAAAAAD3EJ4D%0AAAAAAAAAAAAAAAAAAIB7CM8BAAA AAAAAAAAAAAAAAMA9hOcAAAAAAAAAAAAAAAAAAOAewnMAAAAA%0AAAAAAAAAAAAAAHAP4Tk AAAAAAAAAAAAAAAAAALiH8BwAAAAAAAAAAAAAAAAAANyj8PwAQgghhBBC%0ACCGEEEIIIYQ QQggh5FkKz9cihBBCCCGEEEIIIYQQQgghhBBCnvUfbSGm7D+r+CIAAAAASUVORK5C%0AYII =] Attention patterns of four important Layer 0 heads (columns) for 3 different prompts (rows) prompt showing head specialization. Patterns start from the first newline with red dashes indicating linebreaks. To characterize the mechanism more precisely, we compute the average attention as a function of the number of tokens since the previous newline and also as a function of the character length of individual tokens. [gkNaAAAAABJ%0ARU5ErkJggg==] Normalized attention as a function of tokens since newline (left) and token character length (right) for four heads. Each head specialized in a different offset similar to boundary heads. Similar to boundary detection, individual attention heads specialize in particular offsets to tile the space. Moreover, we observe that most of these attention heads have a bias towards attending to longer tokens. In addition to QK, a head can change its output based on the OV circuit * A Mathematical Framework for Transformer Circuits  [140][HTML] N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, N. DasSarma, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, C. Olah. Transformer Circuits Thread. 2021. [28] . We study this by analyzing the pairwise interaction of probes as mediated by the OV matrix. Specifically, for the averaged token embedding vectors for each token length [MATH: <semantics><mrow><msub><mi>E</mi><mi>t</mi></msub></mrow><annotation encoding="application/x-tex">E_t</annotation></semantics> :MATH] Et​ E_t ^29 That is, for each token character length [MATH: <semantics><mrow><mi>i</mi></mrow><annotation encoding="application/x-tex">i</annotation></semantics> :MATH] i i, we compute the average embedding vector in [MATH: <semantics><mrow><msub><mi>W</mi><mi>E</mi></msub></mrow><annotation encoding="application/x-tex">W_E</annotation></semantics> :MATH] WE​ W_E. We also prepend this with the newline embedding vector to make the plot below., our line length probes [MATH: <semantics><mrow><msub><mi>P</mi><mi>c</mi></msub></mrow><annotation encoding="application/x-tex">P_c</annotation></semantics> :MATH] Pc​ P_c, and the weight matrices for the attention output of each head [MATH: <semantics><mrow><msub><mi>W</mi><mrow><mi>O</mi><mi>V</mi></mrow></msu b></mrow><annotation encoding="application/x-tex">W_{OV}</annotation></semantics> :MATH] WOV​ W_{OV}, we compute [MATH: <semantics><mrow><msubsup><mi>P</mi><mi>c</mi><mi>T</mi></msubsup><msub ><mi>W</mi><mrow><mi>O</mi><mi>V</mi></mrow></msub><msub><mi>E</mi><mi> t</mi></msub></mrow><annotation encoding="application/x-tex">P_c^T W_{OV} E_t</annotation></semantics> :MATH] PcT​WOV​Et​ P_c^T W_{OV} E_t. [8p7izfZ2EE1icCuM1%0AOstenRQEwd4Q8jwIgiAIgiAIgiAIgiAIgiAIgiAIgiDYeYQ8D4 IgCIIgCIIgCIIgCIIgCIIgCIIg%0ACHYeIc+DIAiCIAiCIAiCIAiCIAiCIAiCIAiCnUfI8y AIgiAIgiAIgiAIgiAIgiAIgiAIgmDnEfI8%0ACIIgCIIgCIIgCIIgCIIgCIIgCIIg2HmEPA +CIAiCIAiCIAiCIAiCIAiCIAiCIAh2HiHPgyAIgiAI%0AgiAIgiAIgiAIgiAIgiAIgp1HyP MgCIIgCIIgCIIgCIIgCIIgCIIgCIJg5xHyPAiCIAiCIAiCIAiC%0AIAiCIAiCIAiCINh5hD wPgiAIgiAIgiAIgiAIgiAIgiAIgiAIdh4hz4MgCIIgCIIgCIIgCIIgCIIg%0ACIIgCIKdR8 jzIAiCIAiCIAiCIAiCIAiCIAiCIAiCYOcR8jwIgiAIgiAIgiAIgiAIgiAIgiAIgiDY%0AeY Q8D4IgCIIgCIIgCIIgCIIgCIIgCIIgCHYeIc+DIAiCIAiCIAiCIAiCIAiCIAiCIAiCnUfI8 yAI%0AgiAIgiAIgiAIgiAIgiAIgiAIgmDnEfI8CIIgCIIgCIIgCIIgCIIgCIIgCIIg2HmEP A+CIAiCIAiC%0AIAiCIAiCIAiCIAiCIAh2HmcFQRAEQRAEQRAEQRAEQRAEQRAEQRAEwa7jf wF+ROQQc29tTAAAAABJ%0ARU5ErkJggg==] Inner product of line character count and token character count through OV for 4 important layer 0 heads. The OV responses reflect the difference in attention pattern biases. The output of each head can be thought of as having two components: (1) a character offset from the newline driven by the attention pattern and (2) an adjustment based on the actual character length of the tokens. Note that the average character count of a token is approximately 4.5 (and the median is 4), so we can interpret these effects shifting from a mean response (i.e. the transition point is always around count 4). To walk through a head in action, consider the perspective of L0H1, which attends to the newline for the first ~4 tokens and then spreads out attention over the previous ~4–8 tokens: * While attending to the newline, L0H1 writes to the 5–20 character count (5–20CC) directions and suppresses the 30–80CC directions. This makes sense as an approximation because attending to the newline implies that the line is currently at most 3 tokens long (newline has no width) and on average, any 3 token span is ~15 characters (and is unlikely to be >30). * While not attending to the newline at all, L0H1 defaults to predicting CC40, since not attending to the newline implies that there are ~8 tokens in the line with ~5 characters each on average (including spaces). Then, there is an additional correction applied depending on how long the tokens are: * (1) If tokens being attended to are short (<4 chars), then upweight 10–35CC and downweight >40CC. * (2) If tokens being attended to are long (≥5 chars) then do the opposite. * In cases with some attention on newlines and nonnewlines, linearly interpolate the above predictions. Other heads perform a similar operation, except with different offsets depending on their newline sink behavior. Layer 1 heads also perform a similar operation, though they also can leverage the character count estimate of the Layer 0 heads (see [141]Layer 1 Head OVs). [142]Layer 1 Head OVs Similar to the OVs of the Layer 0 attention heads, Layer 1 heads write to the character count features in accordance to how long the tokens they attend to are. [H8eA%0A7+Yn1JbrAAAAAElFTkSuQmCC] However, in addition to the token character length, Layer 1 heads also use the initial line length estimate constructed in Layer 0 to create a more refined estimate of the character count. [wPpELArAO2zjAAAAABJRU5ErkJggg==] These repeated computations appear responsible for implementing the sharpening of representations. [NcRERERERERERERERERERaHs1zERERERERERERERERERFpeTTPRURERERERERE%0ARERER ESk5dE8FxERERERERERERERERGRlkfzXEREREREREREREREREREWh7NcxERERERERERERER %0AERERaXn+D9kXigMBMG4JAAAAAElFTkSuQmCC] [143]Full Layer 0 Attention Results Below, we show head sums for 3 different prompts with different line widths. [BUgeYxxzOcAAAAAElFTkSuQmCC] As before, we can look at their decomposition. [85rXvOa17zmNa95zWte85rXvOY1r3nN%0Aa17zmte85jWvec1rXvOa17zm9ZRfHxgMBoPB YDAYDAaDwWAwGAwGg8FgMBgMBoPB4KnjvwBu64bm%0AYjda4wAAAABJRU5ErkJggg==] [144]More Sensory and Counting Representations While in this work we carefully studied the perception of line lengths and fixed width text, there are many tasks which language models must perform which benefit from a positional, visual, or spatial representation of the underlying text. In the course of our investigation, we came across several other feature families and representations for these behaviors and report several below. What Follows an Early Linebreak? In addition to line width for tracking the absolute character length of a full line of text, there also exist features that are sensitive to lines which have ended early (i.e, lines where the character count is substantially shorter than the line width [MATH: <semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics> :MATH] k k). While these features are less useful for linebreaking, they enable the model to better predict the token following a linebreak. Specifically, if a line ends [MATH: <semantics><mrow><mi>c</mi></mrow><annotation encoding="application/x-tex">c</annotation></semantics> :MATH] c c characters before the line limit [MATH: <semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics> :MATH] k k, the next word should be at least [MATH: <semantics><mrow><mi>c</mi></mrow><annotation encoding="application/x-tex">c</annotation></semantics> :MATH] c c characters, otherwise it would have been able to fit in the previous line. [XHDly5MiRI0eOHDly5MiRI0eOHDly5MiRI0eOHDly5MiR4ygfxPPfypEjR44cOXLkyJEj% 0AR44cOXLkyJEjR44cOXLkyJEjR44cOY7y8f8AxpDajlyYeiQAAAAASUVORK5CYII=] A feature family for how many characters were remaining in a line after it was broken. It is worth emphasizing that the role of these features, like others in this work, is not obvious from a typical workflow of quickly looking at dataset examples. It might be tempting to ignore these as "newline" features, but careful analysis yields quite clear behavior. Markdown Table Representations In addition to prose, language models must parse other kinds of more structured data like tables. Accurate prediction of a table’s content requires careful integration of row and column information (e.g. is this a column of text or numbers?). To facilitate this, we use a synthetic dataset of 20 markdown tables to find feature families which activate on separator tokens, specialized to particular rows or columns. Visualizing feature activations on each of these 20 tables (arranged by location in the table) showed clear patterns. [wMo7cRDRwqVdwAA%0AAABJRU5ErkJggg==] A feature family for the row index in markdown tables. Activations are shown for 20 tables on the "|" token in the first column of the nth row.[wVfz+Iawv+xRQAAAABJRU5ErkJggg==] A feature family for the column index in markdown tables. Activations are shown for 20 tables on the "|" tokens in the nth column. On a synthetic dataset of larger tables, we also observe counting representations for the column and row index that resemble the character counting representations. Specifically, we see ringing in the pairwise probe cosine similarities and the characteristic “baseball seam” in the PCA basis. [CxZmbuWEtSwAAAABJRU5ErkJggg==] Representations of markdown table row indices (left) and column indices (right). (Top) pairwise inner products of probes trained to predict the index; (bottom) probes projected into a 3D PCA basis. [145]Rejected Titles * A General Language Assistant as a Laboratory for A-line-ment * A-line-ment Science: The Geometry of Textual Perception * The Geometry of Textual Perception: How Models Stay Aligned * The Geometry of Counting: How Transformers Perceive and Manipulate Spatial Structure in Text * The Mechanistic Basis of Alignment * Reading between the lines: The Perception of Linebreaks * Linebreaking: More than you wanted to know * The Line Must Be Drawn Here! Character Counting in Neural Networks * Newline, Who Dis? Attention Heads and Their Distractible Nature * The End of the Line: How Language Models Count Characters * How I Learned to Stop Worrying and Love the Carriage Return * Breaking down the Linebreak Mechanism: The Geometry of Text Perception * We Found a GPS Inside a GPT Footnotes 1. All features have a magnitude dimension; so a discrete feature is a one-dimensional ray, and a one-dimensional feature manifold is the set of all scalings of that manifold, contracting to the origin. See [146]What is a Linear Representation? What is a Multidimensional Feature?[147][↩] 2. Michaud et al. looked for “quanta” of model skills by clustering gradients + The Quantization Model of Neural Scaling  [148][link] E.J. Michaud, Z. Liu, U. Girit, M. Tegmark. Thirty-seventh Conference on Neural Information Processing Systems. 2023. [5] . Their Figure 1 shows that predicting newlines in fixed-width text formed one of the top 400 clusters for the smallest model in the Pythia family, with 70m parameters.[149][↩] 3. The wrapping constraint is implicit. Each newline gives a lower bound (the previous word did fit) and an upper bound (the next word did not). We do not nail down the extent to which the model performs optimal inference with respect to those constraints, rather focusing on how it approximately uses the length of each preceding line to determine whether to break the next. There are also many edge cases for handling tokenization and punctuation. A model could even attempt to infer whether the source document used a non-monospace font and then use the pixel count rather than the character count as a predictive signal![150][↩] 4. We actually first tried to use patching and probing without looking at the graph as a kind of methodological test of the utility of features, but did not make much progress. In hindsight, we were training probes for quantities different than the ones the model represents cleanly, e.g., a fusion of the current token position and the line width.[151][↩] 5. Ringing, in the manifold perspective, corresponds to interference in the feature superposition perspective.[152][↩] 6. Orthogonal dimensions would also not be robust to estimation noise.[153][↩] 7. Each feature has an encoder, which acts as a linear + (Jump)ReLU probe on the residual stream, and a decoder. Ten features [MATH: <semantics><mrow><msub><mi>f</mi><mn>1</mn></msub><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msub><mi>f</mi><mrow><mn>1</mn><mn>0</mn></mrow ></msub></mrow><annotation encoding="application/x-tex">f_1,\ldots,f_{10}</annotation></semantics> :MATH] f1​,…,f10​ f_1,\ldots,f_{10} are associated with line character count. The model's estimate of the character count, given a residual stream vector [MATH: <semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics> :MATH] x x, is summarized by the set of activities of each of the 10 features [MATH: <semantics><mrow><mo>{</mo><msub><mi>f</mi><mi>i</mi></msub><mo>(</mo>< mi>x</mi><mo>)</mo><mo>}</mo></mrow><annotation encoding="application/x-tex">\{f_i(x)\}</annotation></semantics> :MATH] {fi​(x)} \{f_i(x)\}.[154][↩] The model's estimate of the character count is summarized by the projection [MATH: <semantics><mrow><mi>π</mi><mo>(</mo><mi>x</mi><mo>)</mo></mrow><annota tion encoding="application/x-tex">\pi(x)</annotation></semantics> :MATH] π(x) \pi(x) of [MATH: <semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics> :MATH] x x onto that subspace. Two datapoints have similar character counts if their projections are close in that subspace.[155][↩] The model's estimate of the character count is summarized by the nearest point on the manifold to the projection of [MATH: <semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics> :MATH] x x into the subspace, and its confidence in that estimate by the magnitude of [MATH: <semantics><mrow><mi>π</mi><mo>(</mo><mi>x</mi><mo>)</mo></mrow><annota tion encoding="application/x-tex">\pi(x)</annotation></semantics> :MATH] π(x) \pi(x).[156][↩] The model's estimate of the character count is summarized by the probability distribution given by the softmax of the probe activities, softmax [MATH: <semantics><mrow><mo>(</mo><mi>P</mi><mi>x</mi><mo>)</mo></mrow><annota tion encoding="application/x-tex">(Px)</annotation></semantics> :MATH] (Px) (Px).[157][↩] Note, in general one should not assume that a subspace spanned by features (or a PCA) is dedicated to those features because it could be in superposition with many other features. However, because in this case the character count subspace is densely active (and therefore less amenable to being in superposition), this experimental design is more justified.[158][↩] The attribution graph has several positional features and edges on both the last token (“called”) as well as the second-to-last token (“also”). We change the “also” count representation to be 6 characters prior to that for the final token, to maintain consistency.[159][↩] as a 150-way multiclass classification problem[160][↩] We use the term “[161]ringing” in the sense of signal processing, a transient oscillation in response to a sharp peak, such as in the Gibbs Phenomenon).[162][↩] The simulation can sometimes find itself in local minima. Increasing the width of the attractive zone before decreasing it again usually solves this issue.[163][↩] Optimization in dimension 3, unlike in higher dimensions, admits bad local minima, because a generic curve on the surface of a sphere self-intersects. To avoid this, either increase the zone width until you get a great circle, then decrease it, or do the optimization in 4D, then select 3D.[164][↩] Specifically we multiply the line width probes through [MATH: <semantics><mrow><msub><mi>W</mi><mi>K</mi></msub></mrow><annotation encoding="application/x-tex">W_K</annotation></semantics> :MATH] WK​ W_K and the character count probes through [MATH: <semantics><mrow><msub><mi>W</mi><mi>Q</mi></msub></mrow><annotation encoding="application/x-tex">W_Q</annotation></semantics> :MATH] WQ​ W_Q, and plot the points in the 3D PCA basis of their joint embedding.[165][↩] This algorithm also generalizes to arbitrary kinds of separators (e.g., double newlines or pipes), as the QK circuit can handle the positional offset independently of the OV circuit copying the separator type.[166][↩] There are also multiple sets of boundary heads at multiple layers that usually come in sets of ~3 with similar relative offsets (so not actually “stereo”).[167][↩] Influence in the sense of influence on the logit node, as defined in Ameisen et al. * Circuit Tracing: Revealing Computational Graphs in Language Models  [168][HTML] E. Ameisen, J. Lindsey, A. Pearce, W. Gurnee, N.L. Turner, B. Chen, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. McDougall, H. Cunningham, T. Henighan, A. Jermyn, A. Jones, A. Persic, Z. Qi, T. Ben Thompson, S. Zimmerman, K. Rivoire, T. Conerly, C. Olah, J. Batson. Transformer Circuits. 2025. [6] [169][↩] These features also sometimes activate on zero-width modifier tokens (e.g., a token which indicates the first letter of the following token should be capitalized) that need to be adjacent to the modified token, and the modified token is sufficiently long to go over the line limit (e.g. for “Aluminum” instead of “aluminum”).[170][↩] We use the true next non-newline token as the label. This is an approximation because it assumes that the model perfectly predicts the next token. [171][↩] This sum is principled because both sets of vectors are marginalized data means, so collectively have the mean of the data, which we center to be 0.[172][↩] We display the average outputs over many prompts.[173][↩] The prediction is the argmax of the head outputs projected on the character count probes.[174][↩] We omit a previous token head for visual presentation.[175][↩] Tokens do not come annotated with character counts, and there are no vertical bars on the page showing the line width.[176][↩] The entire continuous circle embeds into the infinite-dimensional Hilbert space [MATH: <semantics><mrow><msup><mi>L</mi><mn>2</mn></msup><mrow><msup><mi mathvariant="double-struck">S</mi><mn>1</mn></msup></mrow></mrow><annot ation encoding="application/x-tex">L^2\mathbb{S^1}</annotation></semantics> :MATH] L2S1 L^2\mathbb{S^1} via this construction.[177][↩] That is, for each token character length [MATH: <semantics><mrow><mi>i</mi></mrow><annotation encoding="application/x-tex">i</annotation></semantics> :MATH] i i, we compute the average embedding vector in [MATH: <semantics><mrow><msub><mi>W</mi><mi>E</mi></msub></mrow><annotation encoding="application/x-tex">W_E</annotation></semantics> :MATH] WE​ W_E. We also prepend this with the newline embedding vector to make the plot below.[178][↩] References 1. Feature Manifold Toy Model  [179][link] Olah, C. and Batson, J., 2023. 2. What is a Linear Representation? What is a Multidimensional Feature?  [180][link] Olah, C., 2024. 3. Curve Detector Manifolds in InceptionV1  [181][link] Gorton, L., 2024. 4. Not All Language Model Features Are One-Dimensionally Linear  [182][link] Engels, J., Michaud, E.J., Liao, I., Gurnee, W. and Tegmark, M., 2025. The Thirteenth International Conference on Learning Representations. 5. The Quantization Model of Neural Scaling  [183][link] Michaud, E.J., Liu, Z., Girit, U. and Tegmark, M., 2023. Thirty-seventh Conference on Neural Information Processing Systems. 6. Circuit Tracing: Revealing Computational Graphs in Language Models  [184][HTML] Ameisen, E., Lindsey, J., Pearce, A., Gurnee, W., Turner, N.L., Chen, B., Citro, C., Abrahams, D., Carter, S., Hosmer, B., Marcus, J., Sklar, M., Templeton, A., Bricken, T., McDougall, C., Cunningham, H., Henighan, T., Jermyn, A., Jones, A., Persic, A., Qi, Z., Ben Thompson, T., Zimmerman, S., Rivoire, K., Conerly, T., Olah, C. and Batson, J., 2025. Transformer Circuits. 7. The Origins of Representation Manifolds in Large Language Models Modell, A., Rubin-Delanchy, P. and Whiteley, N., 2025. arXiv preprint arXiv:2505.18235. 8. From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit Costa, V., Fel, T., Lubana, E.S., Tolooshams, B. and Ba, D., 2025. arXiv preprint arXiv:2506.03093. 9. Interpretability Dreams  [185][HTML] Olah, C., 2023. 10. A structural probe for finding syntax in word representations  [186][PDF] Hewitt, J. and Manning, C.D., 2019. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4129--4138. [187]DOI: 10.18653/v1/N19-1419 11. Visualizing and measuring the geometry of BERT  [188][PDF] Coenen, A., Reif, E., Yuan, A., Kim, B., Pearce, A., Viégas, F. and Wattenberg, M., 2019. Advances in Neural Information Processing Systems, Vol 32. 12. The geometry of multilingual language model representations Chang, T.A., Tu, Z. and Bergen, B.K., 2022. arXiv preprint arXiv:2205.10964. 13. Relational composition in neural networks: A survey and call to action Wattenberg, M. and Viegas, F.B., 2024. arXiv preprint arXiv:2407.14662. 14. The geometry of categorical and hierarchical concepts in large language models Park, K., Choe, Y.J., Jiang, Y. and Veitch, V., 2024. arXiv preprint arXiv:2406.01506. 15. The geometry of concepts: Sparse autoencoder feature structure Li, Y., Michaud, E.J., Baek, D.D., Engels, J., Sun, X. and Tegmark, M., 2025. Entropy, Vol 27(4), pp. 344. MDPI. 16. The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence  [189][link] Wollschlager, T., Elstner, J., Geisler, S., Cohen-Addad, V., Gunnemann, S. and Gasteiger, J., 2025. arXiv preprint arXiv:2502.17420. 17. Projecting assumptions: The duality between sparse autoencoders and concept geometry Hindupur, S.S.R., Lubana, E.S., Fel, T. and Ba, D., 2025. arXiv preprint arXiv:2503.01822. 18. Sparse Crosscoders for Cross-Layer Features and Model Diffing  [190][HTML] Lindsey, J., Templeton, A., Marcus, J., Conerly, T., Batson, J. and Olah, C., 2024. 19. Curve Circuits  [191][link] Cammarata, N., Goh, G., Carter, S., Voss, C., Schubert, L. and Olah, C., 2021. Distill. 20. The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision  [192][link] Gorton, L., 2024. arXiv preprint arXiv:2406.03662. 21. Place cells, grid cells, and the brain's spatial representation system.  [193][link] Moser, E.I., Kropff, E. and Moser, M., 2008. Annual review of neuroscience, Vol 31, pp. 69-89. 22. The neural basis of the Weber--Fechner law: a logarithmic mental number line Dehaene, S., 2003. Trends in cognitive sciences, Vol 7(4), pp. 145--147. Elsevier. 23. Tuning curves for approximate numerosity in the human intraparietal sulcus Piazza, M., Izard, V., Pinel, P., Le Bihan, D. and Dehaene, S., 2004. Neuron, Vol 44(3), pp. 547--555. Elsevier. 24. A Toy Model of Interference Weights  [194][HTML] Olah, C., Turner, N.L. and Conerly, T., 2025. 25. {GPT-2}'s positional embedding matrix is a helix  [195][link] Yedidia, A., 2023. 26. The positional embedding matrix and previous-token heads: how do they actually work?  [196][link] Yedidia, A., 2023. Alignment Forum. 27. Representation of geometric borders in the entorhinal cortex Solstad, T., Boccara, C.N., Kropff, E., Moser, M. and Moser, E.I., 2008. Science, Vol 322(5909), pp. 1865--1868. American Association for the Advancement of Science. 28. A Mathematical Framework for Transformer Circuits  [197][HTML] Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S. and Olah, C., 2021. Transformer Circuits Thread. 29. The Muller-Lyer illusion explained by the statistics of image--source relationships Howe, C.Q. and Purves, D., 2005. Proceedings of the National Academy of Sciences, Vol 102(4), pp. 1234--1239. National Academy of Sciences. 30. A review on various explanations of Ponzo-like illusions Yildiz, G.Y., Sperandio, I., Kettle, C. and Chouinard, P.A., 2022. Psychonomic Bulletin \& Review, Vol 29(2), pp. 293--320. Springer. 31. Space and time in visual context Schwartz, O., Hsu, A. and Dayan, P., 2007. Nature Reviews Neuroscience, Vol 8(7), pp. 522--535. Nature Publishing Group UK London. 32. On the Biology of a Large Language Model  [198][HTML] Lindsey, J., Gurnee, W., Ameisen, E., Chen, B., Pearce, A., Turner, N.L., Citro, C., Abrahams, D., Carter, S., Hosmer, B., Marcus, J., Sklar, M., Templeton, A., Bricken, T., McDougall, C., Cunningham, H., Henighan, T., Jermyn, A., Jones, A., Persic, A., Qi, Z., Thompson, T.B., Zimmerman, S., Rivoire, K., Conerly, T., Olah, C. and Batson, J., 2025. Transformer Circuits Thread. 33. A primer in bertology: What we know about how bert works  [199][link] Rogers, A., Kovaleva, O. and Rumshisky, A., 2020. Transactions of the Association for Computational Linguistics, Vol 8, pp. 842--866. MIT Press. [200]DOI: 10.1162/tacl_a_00349 34. Zoom In: An Introduction to Circuits  [201][link] Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M. and Carter, S., 2020. Distill. [202]DOI: 10.23915/distill.00024.001 35. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small  [203][link] Wang, K., Variengien, A., Conmy, A., Shlegeris, B. and Steinhardt, J., 2022. arXiv preprint arXiv:2211.00593. 36. Progress measures for grokking via mechanistic interpretability  [204][link] Nanda, N., Chan, L., Lieberum, T., Smith, J. and Steinhardt, J., 2023. arXiv preprint arXiv:2301.05217. 37. (How) Do Language Models Track State? Li, B.Z., Guo, Z.C. and Andreas, J., 2025. arXiv preprint arXiv:2503.02854. 38. Automatically identifying local and global circuits with linear computation graphs  [205][link] Ge, X., Zhu, F., Shu, W., Wang, J., He, Z. and Qiu, X., 2024. arXiv preprint arXiv:2405.13868. 39. Transcoders find interpretable LLM feature circuits  [206][PDF] Dunefsky, J., Chlenski, P. and Nanda, N., 2025. Advances in Neural Information Processing Systems, Vol 37, pp. 24375--24410. 40. Tracing Attention Computation Through Feature Interactions  [207][HTML] Kamath, H., Ameisen, E., Kauvar, I., Luger, R., Gurnee, W., Pearce, A., Zimmerman, S., Batson, J., Conerly, T., Olah, C. and Lindsey, J., 2025. Transformer Circuits Thread. 41. Neurons in large language models: Dead, n-gram, positional Voita, E., Ferrando, J. and Nalmpantis, C., 2023. arXiv preprint arXiv:2309.04827. 42. Understanding positional features in layer 0 {SAE}s  [208][link] Chughtai, B. and Lau, Y., 2024. 43. Universal neurons in gpt2 language models  [209][link] Gurnee, W., Horsley, T., Guo, Z.C., Kheirkhah, T.R., Sun, Q., Hathaway, W., Nanda, N. and Bertsimas, D., 2024. arXiv preprint arXiv:2401.12181. 44. Why neural translations are the right length Shi, X., Knight, K. and Yuret, D., 2016. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2278--2282. 45. Length Representations in Large Language Models Moon, S., Choi, D., Kwon, J., Kamigaito, H. and Okumura, M., 2025. arXiv preprint arXiv:2507.20398. 46. LSTM networks can perform dynamic counting Suzgun, M., Gehrmann, S., Belinkov, Y. and Shieber, S.M., 2019. arXiv preprint arXiv:1906.03648. 47. Language models need inductive biases to count inductively Chang, Y. and Bisk, Y., 2024. arXiv preprint arXiv:2405.20131. 48. The clock and the pizza: Two stories in mechanistic explanation of neural networks  [210][PDF] Zhong, Z., Liu, Z., Tegmark, M. and Andreas, J., 2023. Advances in neural information processing systems, Vol 36, pp. 27223--27250. 49. Feature emergence via margin maximization: case studies in algebraic tasks Morwani, D., Edelman, B.L., Oncescu, C., Zhao, R. and Kakade, S., 2023. arXiv preprint arXiv:2311.07568. 50. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis  [211][link] Stolfo, A., Belinkov, Y. and Sachan, M., 2023. arXiv preprint arXiv:2305.15054. 51. Pre-trained large language models use fourier features to compute addition  [212][link] Zhou, T., Fu, D., Sharan, V. and Jia, R., 2024. arXiv preprint arXiv:2406.03445. 52. Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics  [213][link] Nikankin, Y., Reusch, A., Mueller, A. and Belinkov, Y., 2024. 53. Language Models Use Trigonometry to Do Addition  [214][link] Kantamneni, S. and Tegmark, M., 2025. 54. Understanding In-context Learning of Addition via Activation Subspaces Hu, X., Yin, K., Jordan, M.I., Steinhardt, J. and Chen, L., 2025. arXiv preprint arXiv:2505.05145. 55. Number Representations in LLMs: A Computational Parallel to Human Perception AlquBoj, H., AlQuabeh, H., Bojkovic, V., Hiraoka, T., El-Shangiti, A.O., Nwadike, M. and Inui, K., 2025. arXiv preprint arXiv:2502.16147. 56. How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model  [215][PDF] Hanna, M., Liu, O. and Variengien, A., 2023. Advances in Neural Information Processing Systems, Vol 36, pp. 76033--76060. 57. Successor Heads: Recurring, Interpretable Attention Heads In The Wild  [216][link] Gould, R., Ong, E., Ogden, G. and Conmy, A., 2023. 58. Curve Detectors  [217][link] Cammarata, N., Goh, G., Carter, S., Schubert, L., Petrov, M. and Olah, C., 2020. Distill. 59. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets  [218][link] Marks, S. and Tegmark, M., 2023. arXiv preprint arXiv:2310.06824. 60. How do language models bind entities in context?  [219][link] Feng, J. and Steinhardt, J., 2023. arXiv preprint arXiv:2310.17191. 61. Understanding sparse autoencoder scaling in the presence of feature manifolds  [220][PDF] Michaud, E.J., Gorton, L. and McGrath, T., 2025. 62. Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning Huang, X. and Hahn, M., 2025. arXiv preprint arXiv:2508.01916. 63. Monotonic representation of numeric properties in language models Heinzerling, B. and Inui, K., 2024. arXiv preprint arXiv:2403.10381. 64. Language Models Represent Space and Time  [221][link] Gurnee, W. and Tegmark, M., 2024. 65. A neural manifold view of the brain Perich, M.G., Narain, D. and Gallego, J.A., 2025. Nature Neuroscience, pp. 1--16. Nature Publishing Group US New York. 66. Position: An inner interpretability framework for AI inspired by lessons from cognitive neuroscience Vilas, M.G., Adolfi, F., Poeppel, D. and Roig, G., 2024. arXiv preprint arXiv:2406.01352. 67. Multilevel interpretability of artificial neural networks: leveraging framework and methods from neuroscience He, Z., Achterberg, J., Collins, K., Nejad, K., Akarca, D., Yang, Y., Gurnee, W., Sucholutsky, I., Tang, Y., Ianov, R. and others,, 2024. arXiv preprint arXiv:2408.12664. 68. Cognitively Inspired Interpretability in Large Neural Networks Leshinskaya, A., Webb, T., Pavlick, E., Feng, J., Opielka, G., Stevenson, C. and Blank, I.A., 2025. Proceedings of the Annual Meeting of the Cognitive Science Society, Vol 47. 69. Softmax Linear Units  [222][HTML] Elhage, N., Hume, T., Olsson, C., Nanda, N., Henighan, T., Johnston, S., ElShowk, S., Joseph, N., DasSarma, N., Mann, B., Hernandez, D., Askell, A., Ndousse, K., Jones, A., Drain, D., Chen, A., Bai, Y., Ganguli, D., Lovitt, L., Hatfield-Dodds, Z., Kernion, J., Conerly, T., Kravec, S., Fort, S., Kadavath, S., Jacobson, J., Tran-Johnson, E., Kaplan, J., Clark, J., Brown, T., McCandlish, S., Amodei, D. and Olah, C., 2022. Transformer Circuits Thread. 70. Finding Neurons in a Haystack: Case Studies with Sparse Probing  [223][link] Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troitskii, D. and Bertsimas, D., 2023. arXiv preprint arXiv:2305.01610. 71. Information flow routes: Automatically interpreting language models at scale Ferrando, J. and Voita, E., 2024. arXiv preprint arXiv:2403.00824. 72. The remarkable robustness of llms: Stages of inference? Lad, V., Lee, J.H., Gurnee, W. and Tegmark, M., 2024. arXiv preprint arXiv:2406.19384. 73. Beyond the doors of perception: Vision transformers represent relations between objects Lepori, M., Tartaglini, A., Vong, W.K., Serre, T., Lake, B.M. and Pavlick, E., 2024. Advances in Neural Information Processing Systems, Vol 37, pp. 131503--131544. 74. Language models can explain neurons in language models  [224][HTML] Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J. and Saunders, W., 2023. 75. Automatically interpreting millions of features in large language models Paulo, G., Mallen, A., Juang, C. and Belrose, N., 2024. arXiv preprint arXiv:2410.13928. 76. Enhancing automated interpretability with output-centric feature descriptions Gur-Arieh, Y., Mayan, R., Agassy, C., Geiger, A. and Geva, M., 2025. arXiv preprint arXiv:2501.08319. 77. A multimodal automated interpretability agent Shaham, T.R., Schwettmann, S., Wang, F., Rajaram, A., Hernandez, E., Andreas, J. and Torralba, A., 2024. Forty-first International Conference on Machine Learning. 78. Building and evaluating alignment auditing agents Bricken, T., Wang, R., Bowman, S., Ong, E., Treutlein, J., Wu, J., Hubinger, E. and Marks, S., 2025. 79. A is for absorption: Studying feature splitting and absorption in sparse autoencoders  [225][link] Chanin, D., Wilken-Smith, J., Dulka, T., Bhatnagar, H. and Bloom, J., 2024. arXiv preprint arXiv:2409.14507. References Visible links: 1. https://transformer-circuits.pub/ 2. https://www.anthropic.com/ 3. mailto:joshb@anthropic.com 4. https://transformer-circuits.pub/2025/linebreaks/index.html#introduction 5. https://transformer-circuits.pub/2025/linebreaks/index.html#char-count 6. https://transformer-circuits.pub/2025/linebreaks/index.html#boundary 7. https://transformer-circuits.pub/2025/linebreaks/index.html#prediction 8. https://transformer-circuits.pub/2025/linebreaks/index.html#count-algo 9. https://transformer-circuits.pub/2025/linebreaks/index.html#illusion 10. https://transformer-circuits.pub/2025/linebreaks/index.html#related-work 11. https://transformer-circuits.pub/2025/linebreaks/index.html#discussion 12. https://transformer-circuits.pub/2025/linebreaks/index.html#introduction 13. https://transformer-circuits.pub/2023/may-update/index.html#feature-manifolds 14. https://transformer-circuits.pub/2024/july-update/index.html#linear-representations 15. https://livgorton.com/curve-detector-manifolds/ 16. https://openreview.net/forum?id=d63a4AM4hb 17. https://transformer-circuits.pub/2024/july-update/index.html#linear-representations 18. https://openreview.net/forum?id=3tbTw2ga8K 19. https://transformer-circuits.pub/2025/attribution-graphs/methods.html 20. https://transformer-circuits.pub/2025/linebreaks/index.html#char-count 21. https://transformer-circuits.pub/2025/linebreaks/index.html#boundary 22. https://transformer-circuits.pub/2025/linebreaks/index.html#prediction 23. https://transformer-circuits.pub/2025/linebreaks/index.html#count-algo 24. https://transformer-circuits.pub/2025/linebreaks/index.html#illusion 25. https://livgorton.com/curve-detector-manifolds/ 26. https://openreview.net/forum?id=d63a4AM4hb 27. https://transformer-circuits.pub/2023/interpretability-dreams/index.html 28. https://aclanthology.org/N19-1419.pdf 29. https://doi.org/10.18653/v1/N19-1419 30. https://proceedings.neurips.cc/paper_files/paper/2019/file/159c1ffe5b61b41b3c4d8f4c2150f6c4-Paper.pdf 31. https://arxiv.org/pdf/2502.17420 32. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-task-performance 33. https://transformer-circuits.pub/2024/crosscoders/index.html 34. https://transformer-circuits.pub/2025/linebreaks/index.html#char-count 35. https://distill.pub/2020/circuits/curve-circuits 36. https://arxiv.org/pdf/2406.03662 37. https://api.semanticscholar.org/CorpusID:16036900 38. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-feature-splitting 39. https://transformer-circuits.pub/2025/linebreaks/index.html#rippled-representations 40. https://en.wikipedia.org/wiki/Ringing_(signal) 41. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-gibbs 42. https://transformer-circuits.pub/2025/interference-weights/index.html 43. https://openreview.net/forum?id=d63a4AM4hb 44. https://transformer-circuits.pub/2023/may-update/index.html#feature-manifolds 45. https://livgorton.com/curve-detector-manifolds/ 46. https://www.lesswrong.com/posts/qvWP3aBDBaqXvPNhS/gpt-2-s-positional-embedding-matrix-is-a-helix 47. https://www.alignmentforum.org/posts/zRA8B2FJLtTYRgie6/the-positional-embedding-matrix-and-previous-token-heads-how 48. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-sensory 49. https://transformer-circuits.pub/2025/linebreaks/index.html#boundary 50. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-width-features 51. https://transformer-circuits.pub/2025/linebreaks/index.html#rippled-representations 52. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-twisting 53. https://transformer-circuits.pub/2025/linebreaks/index.html#one-dim 54. https://transformer-circuits.pub/2025/linebreaks/index.html#rippled-representations 55. https://transformer-circuits.pub/2025/linebreaks/index.html#count-algo 56. https://transformer-circuits.pub/2025/linebreaks/index.html#discovery 57. https://transformer-circuits.pub/2025/linebreaks/index.html#prediction 58. https://transformer-circuits.pub/2025/attribution-graphs/methods.html 59. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-break-predictors 60. https://transformer-circuits.pub/2025/linebreaks/index.html#count-algo 61. https://transformer-circuits.pub/2025/linebreaks/index.html#rippled-representations 62. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-token-lengths 63. https://transformer-circuits.pub/2021/framework/index.html 64. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-mechanics 65. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-l1-attn 66. https://transformer-circuits.pub/2025/linebreaks/index.html#illusion 67. https://transformer-circuits.pub/2025/linebreaks/index.html#related-work 68. https://transformer-circuits.pub/2025/attribution-graphs/biology.html 69. https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00349/96482/A-Primer-in-BERTology-What-We-Know-About-How-BERT 70. https://doi.org/10.1162/tacl_a_00349 71. https://distill.pub/2020/circuits/zoom-in 72. https://doi.org/10.23915/distill.00024.001 73. https://arxiv.org/pdf/2211.00593 74. https://arxiv.org/pdf/2301.05217 75. https://transformer-circuits.pub/2025/attribution-graphs/methods.html 76. https://arxiv.org/pdf/2405.13868 77. http://arxiv.org/pdf/2406.11944.pdf 78. https://transformer-circuits.pub/2025/attention-qk/index.html 79. https://transformer-circuits.pub/2024/crosscoders/index.html 80. https://openreview.net/forum?id=3tbTw2ga8K 81. https://www.lesswrong.com/posts/qvWP3aBDBaqXvPNhS/gpt-2-s-positional-embedding-matrix-is-a-helix 82. https://www.alignmentforum.org/posts/zRA8B2FJLtTYRgie6/the-positional-embedding-matrix-and-previous-token-heads-how 83. https://bilalchughtai.co.uk/pos-sae/ 84. https://arxiv.org/pdf/2401.12181 85. https://arxiv.org/pdf/2401.12181 86. https://bilalchughtai.co.uk/pos-sae/ 87. https://www.lesswrong.com/posts/qvWP3aBDBaqXvPNhS/gpt-2-s-positional-embedding-matrix-is-a-helix 88. https://arxiv.org/pdf/2301.05217 89. https://proceedings.neurips.cc/paper_files/paper/2023/file/56cbfbf49937a0873d451343ddc8c57d-Paper-Conference.pdf 90. https://arxiv.org/pdf/2305.15054 91. https://arxiv.org/pdf/2406.03445 92. https://arxiv.org/pdf/2410.21272 93. https://arxiv.org/pdf/2502.00873 94. https://arxiv.org/pdf/2406.03445 95. https://arxiv.org/pdf/2502.00873 96. https://arxiv.org/pdf/2301.05217 97. https://arxiv.org/pdf/2502.00873 98. https://arxiv.org/pdf/2502.00873 99. https://proceedings.neurips.cc/paper_files/paper/2023/file/efbba7719cc5172d175240f24be11280-Paper-Conference.pdf 100. https://arxiv.org/pdf/2312.09230 101. https://openreview.net/forum?id=d63a4AM4hb 102. https://distill.pub/2020/circuits/curve-detectors 103. https://arxiv.org/pdf/2406.03662 104. https://aclanthology.org/N19-1419.pdf 105. https://doi.org/10.18653/v1/N19-1419 106. https://proceedings.neurips.cc/paper_files/paper/2019/file/159c1ffe5b61b41b3c4d8f4c2150f6c4-Paper.pdf 107. https://arxiv.org/pdf/2310.06824 108. https://arxiv.org/pdf/2310.17191 109. https://arxiv.org/pdf/2502.17420 110. http://arxiv.org/pdf/2509.02565.pdf 111. https://arxiv.org/pdf/2310.02207 112. https://api.semanticscholar.org/CorpusID:16036900 113. https://transformer-circuits.pub/2025/linebreaks/index.html#discussion 114. https://transformer-circuits.pub/2022/solu/index.html 115. https://arxiv.org/pdf/2305.01610 116. https://distill.pub/2020/circuits/zoom-in 117. https://doi.org/10.23915/distill.00024.001 118. https://www.neuronpedia.org/gemma-2-2b/graph?slug=fourscoreandseve-1757368139332&pruningThreshold=0.8&densityThreshold=0.99&pinnedIds=14_19999_37&clerps=[["14_200290090_37","nearing+end+of+the+line"]] 119. https://www.neuronpedia.org/qwen3-4b/graph?slug=fourscoreandseve-1757451285996&pruningThreshold=0.8&densityThreshold=0.99&clerps=[["30_117634760_39","nearing+end+of+line"]]&pinnedIds=30_15307_39 120. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html 121. https://transformer-circuits.pub/2023/interpretability-dreams/index.html 122. https://transformer-circuits.pub/2025/attribution-graphs/methods.html 123. https://transformer-circuits.pub/2025/linebreaks/index.html#citation-info 124. https://transformer-circuits.pub/2025/linebreaks/index.html#acknowledgments 125. https://transformer-circuits.pub/2025/attribution-graphs/methods.html 126. https://transformer-circuits.pub/2025/attention-qk/index.html 127. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-task-performance 128. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-feature-splitting 129. http://arxiv.org/pdf/2509.02565.pdf 130. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-width-features 131. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-dynamical 132. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-gibbs 133. https://colab.research.google.com/drive/13L51UzyNQ6SnjNZRyhWsx_QuyQ3LoosW#scrollTo=c1q8ozayknvl 134. https://en.wikipedia.org/wiki/Circulant_matrix 135. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-twisting 136. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-break-predictors 137. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-token-lengths 138. https://arxiv.org/pdf/2409.14507 139. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-mechanics 140. https://transformer-circuits.pub/2021/framework/index.html 141. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-l1-attn 142. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-l1-attn 143. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-full-l0-attn 144. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-sensory 145. https://transformer-circuits.pub/2025/linebreaks/index.html#appendix-title 146. https://transformer-circuits.pub/2024/july-update/index.html#linear-representations 147. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-1 148. https://openreview.net/forum?id=3tbTw2ga8K 149. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-2 150. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-3 151. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-4 152. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-5 153. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-6 154. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-7 155. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-8 156. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-9 157. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-10 158. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-11 159. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-12 160. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-13 161. https://en.wikipedia.org/wiki/Ringing_(signal) 162. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-14 163. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-15 164. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-16 165. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-17 166. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-18 167. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-19 168. https://transformer-circuits.pub/2025/attribution-graphs/methods.html 169. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-20 170. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-21 171. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-22 172. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-23 173. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-24 174. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-25 175. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-26 176. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-27 177. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-28 178. https://transformer-circuits.pub/2025/linebreaks/index.html#d-footnote-29 179. https://transformer-circuits.pub/2023/may-update/index.html#feature-manifolds 180. https://transformer-circuits.pub/2024/july-update/index.html#linear-representations 181. https://livgorton.com/curve-detector-manifolds/ 182. https://openreview.net/forum?id=d63a4AM4hb 183. https://openreview.net/forum?id=3tbTw2ga8K 184. https://transformer-circuits.pub/2025/attribution-graphs/methods.html 185. https://transformer-circuits.pub/2023/interpretability-dreams/index.html 186. https://aclanthology.org/N19-1419.pdf 187. https://doi.org/10.18653/v1/N19-1419 188. https://proceedings.neurips.cc/paper_files/paper/2019/file/159c1ffe5b61b41b3c4d8f4c2150f6c4-Paper.pdf 189. https://arxiv.org/pdf/2502.17420 190. https://transformer-circuits.pub/2024/crosscoders/index.html 191. https://distill.pub/2020/circuits/curve-circuits 192. https://arxiv.org/pdf/2406.03662 193. https://api.semanticscholar.org/CorpusID:16036900 194. https://transformer-circuits.pub/2025/interference-weights/index.html 195. https://www.lesswrong.com/posts/qvWP3aBDBaqXvPNhS/gpt-2-s-positional-embedding-matrix-is-a-helix 196. https://www.alignmentforum.org/posts/zRA8B2FJLtTYRgie6/the-positional-embedding-matrix-and-previous-token-heads-how 197. https://transformer-circuits.pub/2021/framework/index.html 198. https://transformer-circuits.pub/2025/attribution-graphs/biology.html 199. https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00349/96482/A-Primer-in-BERTology-What-We-Know-About-How-BERT 200. https://doi.org/10.1162/tacl_a_00349 201. https://distill.pub/2020/circuits/zoom-in 202. https://doi.org/10.23915/distill.00024.001 203. https://arxiv.org/pdf/2211.00593 204. https://arxiv.org/pdf/2301.05217 205. https://arxiv.org/pdf/2405.13868 206. http://arxiv.org/pdf/2406.11944.pdf 207. https://transformer-circuits.pub/2025/attention-qk/index.html 208. https://bilalchughtai.co.uk/pos-sae/ 209. https://arxiv.org/pdf/2401.12181 210. https://proceedings.neurips.cc/paper_files/paper/2023/file/56cbfbf49937a0873d451343ddc8c57d-Paper-Conference.pdf 211. https://arxiv.org/pdf/2305.15054 212. https://arxiv.org/pdf/2406.03445 213. https://arxiv.org/pdf/2410.21272 214. https://arxiv.org/pdf/2502.00873 215. https://proceedings.neurips.cc/paper_files/paper/2023/file/efbba7719cc5172d175240f24be11280-Paper-Conference.pdf 216. https://arxiv.org/pdf/2312.09230 217. https://distill.pub/2020/circuits/curve-detectors 218. https://arxiv.org/pdf/2310.06824 219. https://arxiv.org/pdf/2310.17191 220. http://arxiv.org/pdf/2509.02565.pdf 221. https://arxiv.org/pdf/2310.02207 222. https://transformer-circuits.pub/2022/solu/index.html 223. https://arxiv.org/pdf/2305.01610 224. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html 225. https://arxiv.org/pdf/2409.14507 Hidden links: 227. https://anthropic.com/

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/y3i12/nabu_nisaba'

If you have feedback or need assistance with the MCP directory API, please join our Discord server