# Section 2: The Semantic Smearing Problem
<!-- Registry references: RAG-001–007, EXT-001–010 -->
<!-- Citation files: ethayarajh_2019_anisotropy.md, semantic_smearing_evidence.md, stochastic_tax_framing.md -->
## 2.1 Anisotropy in Domain-Homogeneous Corpora
Large language models represent text as vectors in high-dimensional embedding spaces, where semantic similarity corresponds to geometric proximity. This representation is effective when the concepts being compared occupy distinct regions of the space. However, Ethayarajh (2019) demonstrated that contextual word representations from models such as BERT, ELMo, and GPT-2 exhibit high anisotropy — the representations occupy a narrow cone in the vector space rather than being uniformly distributed across all directions. In the upper layers of GPT-2, the average cosine similarity between randomly sampled word representations approaches 0.99, meaning that even unrelated concepts are geometrically close.
This property has particular consequences for domain-specific corpora where the vocabulary, sentence structure, and conceptual framing are inherently homogeneous. Federal statistical metadata is an extreme case. Census variable descriptions share a common vocabulary of demographic terms, geographic references, and survey methodology language. A variable measuring median household income in a county and a variable measuring per capita income in a metropolitan statistical area use many of the same words in similar syntactic patterns to describe related but distinct measurements. In embedding space, these descriptions cluster tightly — not because they mean the same thing, but because the representational geometry cannot separate them.
## 2.2 Empirical Evidence: The Enrichment Experiment
We tested this directly using a matched-pairs analysis of 2,500 Census variable descriptions across two embedding models. The experiment compared three representations of each variable: the raw Census label, the label combined with its concept metadata, and an LLM-enriched description incorporating full contextual text generated by a language model.
> **[INSERT FIGURE F2: Semantic smearing — enrichment experiment results showing similarity increase and discrimination collapse for MiniLM-384 and RoBERTa-1024]**
For the all-MiniLM-L6-v2 model (384 dimensions), mean pairwise cosine similarity increased from 0.4297 for raw metadata to 0.6271 for enriched descriptions — a 45.9% increase. More critically, group discrimination — the model's ability to distinguish between variables from different conceptual groups — collapsed by 63.7%. The enrichment process, intended to improve retrieval by adding richer semantic context, instead homogenized the embedding space by introducing shared domain language across all descriptions.
The effect was worse with larger models. RoBERTa-large (1,024 dimensions) showed an 82.2% increase in mean similarity and an 86.5% collapse in discrimination. Higher dimensionality did not resolve the problem; it amplified it by capturing more of the shared domain signal that was already saturating the space.
This finding has a direct implication: the problem is not in the embedding model. It is in the text. Census methodology documentation uses a constrained vocabulary to describe a large number of related but distinct statistical products. Any embedding model operating on this text will produce representations that cluster in a narrow region of the space, because the text itself provides insufficient signal for geometric separation. Adding more text — enriching, expanding, paraphrasing — makes the problem worse by introducing additional shared vocabulary.
We describe this phenomenon as *semantic smearing*: the representations of concepts that should remain distinct are smeared together across the embedding space, making retrieval systems unable to discriminate between them. The metaphor is not a needle in a haystack. It is a needle in a haystack of needles.
## 2.3 Consequences for Retrieval-Based Approaches
Semantic smearing explains why retrieval-augmented generation underperforms expectations in federal statistical domains. Standard RAG systems retrieve document chunks by embedding the user's query and finding the nearest neighbors in the indexed corpus. When the corpus exhibits high anisotropy and domain homogeneity, the nearest neighbors are likely to be semantically adjacent but contextually wrong — a chunk about poverty thresholds when the query concerns poverty rates, or a passage about one-year estimates when the question requires five-year methodology.
GraphRAG systems attempt to address this by augmenting vector retrieval with graph structure, traversing relationships between entities to provide richer context. However, GraphRAG incurs substantially higher infrastructure costs — approximately twice the monthly operating expense of standard RAG for comparable workloads — while retrieving significantly more tokens per query (approximately 47,000 versus 3,700 for top-5 RAG) without proportional quality gains on domain-specific tasks. The additional graph infrastructure adds complexity and maintenance burden without addressing the fundamental problem: the embedding space cannot discriminate in a domain where all the content sounds alike.
Both approaches also introduce stochastic variance into the grounding process. Embedding-based retrieval is inherently approximate — the same query can return different chunks depending on model version, index state, and the numerical precision of similarity computations. This stochastic retrieval compounds with the stochastic nature of language model generation, producing variance at two stages of the pipeline. In domains where precision matters — where the difference between a one-year and five-year estimate, or between a 20% and 40% coefficient of variation, determines whether an answer is useful or harmful — this compounding variance is not a theoretical concern. It is a practical failure mode.
## 2.4 The Judgment Gap
The semantic smearing problem reveals that the challenge facing AI systems in statistical domains is not primarily one of retrieval. Language models already perform the syntactic and semantic tasks — translating natural language into domain-appropriate API calls, identifying relevant variables, resolving geographic entities — with sufficient accuracy for practical use. The control condition in our evaluation demonstrates this: models successfully retrieve correct data from the Census API in the majority of cases without any retrieval augmentation.
What models cannot do reliably is assess the fitness of the data they retrieve. They do not know when a margin of error renders an estimate unreliable, when a geographic nesting assumption does not hold, when a period estimate should not be compared to a point-in-time figure, or when the appropriate response is to decline to provide a number rather than deliver it with false confidence. This is not information that can be retrieved from a document chunk. It is expert judgment about appropriate use — judgment that is formed through professional practice, accumulated through experience with the data and its limitations, and rarely stated explicitly in any single passage of any methodology handbook.
The gap is not in what the model knows. It is in what the model can judge. Filling this gap requires not better retrieval, but a different kind of intervention entirely.
## References
Ethayarajh, K. (2019). How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. *Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP)*. https://arxiv.org/abs/1909.00512