Eliminating Token Bloat in the Model Context Protocol through Generative Code Stubs

The Model Context Protocol (MCP) is an open standard for connecting AI systems to external data and tools. It has quickly become foundational for building complex agentic systems. At its core, MCP defines a client–server architecture and structured schemas that allow a Large Language Model (LLM) to interact with external services, referred to as tools, to retrieve information or perform actions. An agent is an LLM-driven application that orchestrates these tools to achieve a goal.

As agentic systems scale, a critical constraint emerges: the finite and costly nature of the model’s context window. Enterprise agents often connect to dozens of MCP servers that collectively expose thousands of tools. Serializing all tool definitions and intermediate results directly into the model context does not scale. This approach degrades performance, increases inference cost, and limits workflow complexity. Addressing this constraint requires a shift in how models interact with their environment, moving away from prompt-bound tool calls toward execution-driven control, as discussed in the referenced talk¹.

The Dual Problem of Context Bloat

Large-scale agent deployments suffer from two distinct but compounding forms of context bloat. One is static and front-loaded, while the other is dynamic and multiplicative. Together, they make traditional MCP usage increasingly impractical at scale.

1. Tool Definition Bloat

In a conventional MCP setup, the client must provide the LLM with the full action space upfront. This involves fetching and injecting definitions for all available tools, including names, descriptions, and detailed input and output schemas.

In an enterprise environment with twenty MCP servers exposing twenty tools each, the client must serialize four hundred tool definitions. These schemas, often expressed in JSON Schema or similar high-fidelity formats, are not token-efficient. The cumulative overhead can consume a substantial portion of the available context window before the user query or reasoning even begins.

In production systems, this upfront serialization can exceed 100,000 tokens solely for tool definitions. The result is higher latency, increased cost, and reduced capacity for reasoning, conversational history, and user input. For models with smaller context windows, this becomes a hard blocker to adoption.

2. Tool Result Bloat

The second form of bloat appears during multi-step workflows that move large artifacts between tools. A common pattern looks like this:

Call Tool A to retrieve a large artifact, such as a meeting transcript.
Inject the entire artifact into the LLM’s context.
Instruct Tool B to consume that same artifact, requiring the model to reproduce the data inside a new tool call.

This process multiplies token usage. A single 50,000-token document can consume close to 100,000 tokens across retrieval, context injection, and reproduction. Latency and cost scale linearly with data size, making large dataset processing or parallel workflows effectively infeasible.

Code Execution and Agentic Control Flow

To overcome these limitations, agent architectures are shifting from direct, prompt-based tool invocation to code execution–driven control flow. Modern LLMs are capable of generating reliable code, which allows them to treat the MCP ecosystem as a collection of local SDKs rather than remote tools described entirely in text.

In this model, the LLM generates code in a sandboxed, type-safe execution environment, such as TypeScript or Python. The generated code calls locally defined stubs that abstract MCP communication. This architectural shift delivers three critical benefits.

1. Token-Efficient Data Transfer

Large intermediate artifacts are stored and passed within the execution environment rather than being serialized into the LLM context. Data moves by reference through variables, and only small summaries or execution statuses are surfaced back to the model.

// Conceptual model-generated code
import { googleDrive } from "mcp-stubs/google-drive";
import { salesforce } from "mcp-stubs/salesforce";

const transcript = await googleDrive.getDocument({
  documentId: "doc_12345",
  fields: "content"
});

const result = await salesforce.updateRecord({
  objectType: "SalesRecord",
  recordId: "rec_9876",
  data: {
    meeting_notes: transcript.content.slice(0, 1000)
  }
});

console.log("Record Update Status:", result.status);

The raw document content never re-enters the LLM’s context. Only structured outputs or summaries are returned, eliminating Tool Result Bloat while preserving full programmatic control.

2. Progressive Disclosure of Tool Definitions

Tool Definition Bloat is addressed through progressive disclosure. Instead of injecting all tool schemas eagerly, the MCP tool surface is exposed as a browsable structure, such as a virtual file system or dependency tree. The agent inspects this structure and loads only the specific tool definitions it needs at runtime.

This approach replaces eager serialization with targeted retrieval. The LLM reasons over a small, relevant subset of tools rather than an entire enterprise catalog, significantly reducing context usage without sacrificing capability.

3. Advanced Control Flow and Reusable Skills

Generating executable code allows agents to express complex control flow directly:

Loops and parallelism: Standard constructs such as loops or parallel execution enable efficient batch processing and polling workflows.
Conditional logic: Branching based on intermediate results can be handled in code, avoiding repeated reasoning cycles in the LLM context.
Reusable skills: Successful code patterns can be stored and reused as higher-level skills, forming a growing library of compound behaviors authored by the agent itself.

Behind the Scenes: How It Works

Supporting code execution requires specific client-side architecture and a well-defined execution pipeline.

Architecture: Stub Generation and Sandboxing

The system relies on three core components:

Client-side stub generation: After discovering an MCP server, the client performs a tool listing call and deterministically generates typed stubs for each tool. These stubs are organized hierarchically so the LLM can explore them as needed.
Type-safety enforcement: Using a typed language enables validation of model-generated code against MCP schemas before execution. This pre-execution checking acts as a self-correction mechanism and improves reliability.
Secure execution environment: Generated code runs in an isolated sandbox with strict resource limits and no direct internet access. Each execution is short-lived and fully controlled.

Tool Wiring and Harness Integration

The generated stubs delegate execution through a helper function, commonly named callMcpTool. This helper forwards requests to a centralized agent harness, which serves as the policy and trust anchor of the system.

The harness is responsible for applying authentication, enforcing enterprise policies, routing requests to external MCP servers, and returning results directly to the execution environment. Crucially, large results bypass the LLM context entirely. This design ensures that even arbitrary, model-generated code operates within strict security and governance boundaries.

My Thoughts

Code execution marks a maturation of the agentic ecosystem. It replaces brittle prompt-based tool usage with a robust, programmatic interface that scales with both data size and workflow complexity.

For MCP server developers, this shift removes the pressure to bundle functionality into a few high-level tools. With progressive disclosure, servers can expose fine-grained APIs without overwhelming the model context. Composition becomes the responsibility of the agent, not the server author.

For platform and enterprise architects, the value of MCP becomes clearer rather than diminished. MCP provides standardized schemas, authentication, and governance that raw, model-written SDK calls lack. While LLM-generated code introduces bounded non-determinism, safeguards such as type checking, policy enforcement, and sandboxing keep that risk manageable. The gains in capability, efficiency, and scalability outweigh the residual uncertainty. This architecture anticipates continual model improvement and is designed for long-term evolution rather than short-term prompt tuning.

Acknowledgements

This article is based on Sir. Adam Jones talk, Code Execution with MCP: Fix Tool Token Bloat¹, presented at the MCP Developers Summit. Appreciation is also extended to the broader MCP and agentic AI community for advancing open standards and shared architectural patterns.

References

Code Execution with MCP: Fix Tool Token Bloat (Adam Jones, Anthropic)
↩