MCP Server Whisper

mcp-overview.md•93 kB

# Model Context Protocol (MCP) Overview The **Model Context Protocol (MCP)** is an open standard designed to seamlessly connect AI models with external data sources and tools ([Introducing the Model Context Protocol \ Anthropic](https://www.anthropic.com/news/model-context-protocol#:~:text=The%20Model%20Context%20Protocol%20is,that%20connect%20to%20these%20servers)) ([Inside Anthropic’s Model Context Protocol (MCP)to Connect AI Assistants to Data | by Jesus Rodriguez | Towards AI](https://pub.towardsai.net/inside-anthropics-model-context-protocol-mcp-to-connect-ai-assistants-to-data-529fb322ef5b#:~:text=The%20rapid%20advancement%20of%20AI,standard%20for%20seamless%20data%20integration)). In essence, MCP acts like a specialized API layer for LLMs, standardizing how they fetch context or perform actions. Instead of hard-coding integrations for each data source or function, MCP provides a universal protocol. This allows AI assistants (the *MCP clients/hosts*) to interface with any *MCP server* that exposes data or functionalities, without custom glue code ([Introducing the Model Context Protocol \ Anthropic](https://www.anthropic.com/news/model-context-protocol#:~:text=MCP%20addresses%20this%20challenge,to%20the%20data%20they%20need)) ([Anthropic's Model Context Protocol (MCP) is way bigger than most people think : r/ClaudeAI](https://www.reddit.com/r/ClaudeAI/comments/1gzv8b9/anthropics_model_context_protocol_mcp_is_way/#:~:text=1)). The goal is to **eliminate fragmented integrations** – traditionally, adding a new dataset or tool required bespoke code, but MCP replaces that with a single consistent interface ([Introducing the Model Context Protocol \ Anthropic](https://www.anthropic.com/news/model-context-protocol#:~:text=MCP%20addresses%20this%20challenge,to%20the%20data%20they%20need)). This yields more reliable and scalable AI systems by maintaining context across diverse tools and data sources ([Anthropic's Model Context Protocol (MCP) is way bigger than most people think : r/ClaudeAI](https://www.reddit.com/r/ClaudeAI/comments/1gzv8b9/anthropics_model_context_protocol_mcp_is_way/#:~:text=4)). **MCP vs. Traditional Context Management:** In a typical Python application, “context management” often refers to resource handling using `with` statements or manually passing state (e.g. passing a conversation history list to each API call). By contrast, MCP formalizes context as a first-class concept: it separates the LLM’s reasoning from the retrieval of external context. Rather than embedding data directly into prompts or managing files in-line, an MCP-compliant system has the model request what it needs via the protocol. This separation of concerns means the AI-focused code (client/host) remains clean, while data access is handled by dedicated server modules ([Inside Anthropic’s Model Context Protocol (MCP)to Connect AI Assistants to Data | by Jesus Rodriguez | Towards AI](https://pub.towardsai.net/inside-anthropics-model-context-protocol-mcp-to-connect-ai-assistants-to-data-529fb322ef5b#:~:text=This%20architecture%20ensures%20a%20clear,handle%20data%20access%20and%20management)). In practice, MCP clients focus on the conversation logic and user interaction, and MCP servers handle data retrieval or computations – a clear division that improves modularity ([Inside Anthropic’s Model Context Protocol (MCP)to Connect AI Assistants to Data | by Jesus Rodriguez | Towards AI](https://pub.towardsai.net/inside-anthropics-model-context-protocol-mcp-to-connect-ai-assistants-to-data-529fb322ef5b#:~:text=This%20architecture%20ensures%20a%20clear,handle%20data%20access%20and%20management)). Traditional Python context managers (e.g. file I/O) are still used under the hood in MCP servers to manage resources, but the *model’s* “context” (its knowledge of conversation and external info) is maintained through the MCP interface rather than global variables or long prompts. **Key MCP Components:** MCP defines three core interfaces for interactions ([GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients](https://github.com/modelcontextprotocol/python-sdk#:~:text=The%20MCP%20protocol%20defines%20three,primitives%20that%20servers%20can%20implement)): **Tools**, **Resources**, and **Prompts**. - **Tools** – Functions or actions that the LLM can invoke to perform operations (analogous to API endpoints that cause an action or computation) ([GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients](https://github.com/modelcontextprotocol/python-sdk#:~:text=Resources%20Application,actions%20API%20calls%2C%20data%20updates)). For example, a “search_database” tool or an “transcribe_audio” tool. Tools are *model-controlled*: the AI decides when to call them as part of its reasoning ([GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients](https://github.com/modelcontextprotocol/python-sdk#:~:text=Resources%20Application,actions%20API%20calls%2C%20data%20updates)). They typically take arguments and return results. - **Resources** – Read-only data sources that can be retrieved to give the model more information (analogous to GET endpoints) ([GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients](https://github.com/modelcontextprotocol/python-sdk#:~:text=Prompts%20User,application%20File%20contents%2C%20API%20responses)). They are identified by URIs (like files, database queries, etc.) and provide content into the model’s context. For instance, a resource might be `file://logs/app.log` or `db://users/123` that the model can read ([Engineering AI systems with Model Context Protocol · Raygun Blog](https://raygun.com/blog/announcing-mcp/#:~:text=,postgres%3A%2F%2Fdatabase%2Fusers)). Resources are *application-controlled*, meaning the client or developer decides what data is exposed and how ([GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients](https://github.com/modelcontextprotocol/python-sdk#:~:text=Prompts%20User,application%20File%20contents%2C%20API%20responses)). - **Prompts** – Predefined prompt templates or conversational workflows that can be triggered (similar to reusable query or command templates) ([Engineering AI systems with Model Context Protocol · Raygun Blog](https://raygun.com/blog/announcing-mcp/#:~:text=,to%20create%20standardized%20commit%20messages)). These help standardize interactions. For example, a server might offer a prompt template for “summarize recent audio” which the user or system can invoke without crafting the full prompt each time ([Engineering AI systems with Model Context Protocol · Raygun Blog](https://raygun.com/blog/announcing-mcp/#:~:text=,to%20create%20standardized%20commit%20messages)). Prompts are often *user-controlled* or system-controlled as shortcuts, rather than invoked by the model autonomously ([GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients](https://github.com/modelcontextprotocol/python-sdk#:~:text=Primitive%20Control%20Description%20Example%20Use,data%20managed%20by%20the%20client)). **MCP Hosts and Servers:** In MCP architecture, an **MCP host (client)** is typically the AI assistant or application (like Claude Desktop or a custom GPT-4 app) that connects to one or more **MCP servers** ([Introducing the Model Context Protocol \ Anthropic](https://www.anthropic.com/news/model-context-protocol#:~:text=The%20Model%20Context%20Protocol%20is,that%20connect%20to%20these%20servers)) ([Inside Anthropic’s Model Context Protocol (MCP)to Connect AI Assistants to Data | by Jesus Rodriguez | Towards AI](https://pub.towardsai.net/inside-anthropics-model-context-protocol-mcp-to-connect-ai-assistants-to-data-529fb322ef5b#:~:text=The%20MCP%20architecture%20is%20based,server%20model)). The host (client) is responsible for the core LLM interaction – it generates model responses and orchestrates calls to servers when needed. The servers advertise their capabilities (which tools, resources, prompts they provide) during initialization, so the host can discover them and include them in the model’s context (often by describing them in the system prompt or via function-calling interfaces) ([Engineering AI systems with Model Context Protocol · Raygun Blog](https://raygun.com/blog/announcing-mcp/#:~:text=MCP%20consists%20of%20two%20components%3A,but%20still%20supports%20remote%20APIs)) ([Engineering AI systems with Model Context Protocol · Raygun Blog](https://raygun.com/blog/announcing-mcp/#:~:text=,to%20create%20standardized%20commit%20messages)). This arrangement allows a single AI agent to leverage multiple tools or data sources concurrently ([Engineering AI systems with Model Context Protocol · Raygun Blog](https://raygun.com/blog/announcing-mcp/#:~:text=MCP%20consists%20of%20two%20components%3A,but%20still%20supports%20remote%20APIs)). For example, one MCP server might expose a Whisper transcription tool, and another might expose a database query tool; the AI can call either as needed. This is more powerful than a monolithic context – **MCP maintains a unified context across diverse modalities and sources**, enabling more “agentic” AI behavior where the model can autonomously decide to fetch data or execute an action ([Anthropic's Model Context Protocol (MCP) is way bigger than most people think : r/ClaudeAI](https://www.reddit.com/r/ClaudeAI/comments/1gzv8b9/anthropics_model_context_protocol_mcp_is_way/#:~:text=4)). **Reference Implementations:** Anthropic has open-sourced a suite of MCP resources, including an official **Python SDK (the `mcp` package)**, plus SDKs in TypeScript, Kotlin, Java, and more ([Model Context Protocol · GitHub](https://github.com/modelcontextprotocol#:~:text=%2A%20python)) ([Model Context Protocol · GitHub](https://github.com/modelcontextprotocol#:~:text=%2A%20java)). They have also published several *reference MCP servers* for common use cases – e.g. servers for file systems, Git/GitHub, Google Drive, databases (PostgreSQL, SQLite), web browsing (Puppeteer), and even a **Memory** server for persistent knowledge storage ([GitHub - modelcontextprotocol/servers: Model Context Protocol Servers](https://github.com/modelcontextprotocol/servers#:~:text=%2A%20Everything%20%20,based%20persistent%20memory%20system)). These serve as examples of how to implement MCP servers that interface with real systems. For instance, the **SQLite MCP server** allows an AI to run SQL queries on a local database by exposing read/write query tools ([Introducing the Model Context Protocol](https://simonwillison.net/2024/Nov/25/model-context-protocol/#:~:text=far%2C%20some%20using%20the%20Typesscript,mcp%20on%20PyPI)). The existence of these libraries means we can integrate MCP into a Python codebase using high-level APIs rather than writing the protocol from scratch. The **Python SDK** provides classes for creating servers and clients, handling message formatting, and managing connections (over STDIO or other transports) ([mcp · PyPI](https://pypi.org/project/mcp/#:~:text=The%20Model%20Context%20Protocol%20allows,specification%2C%20making%20it%20easy%20to)) ([GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients](https://github.com/modelcontextprotocol/python-sdk#:~:text=The%20MCP%20protocol%20defines%20three,primitives%20that%20servers%20can%20implement)). Additionally, a higher-level helper library called **FastMCP** simplifies server creation by using decorators and type hints to define tools and resources in Python code ([mcp · PyPI](https://pypi.org/project/mcp/#:~:text=,return%20a%20%2B%20b)) ([GitHub - jlowin/fastmcp: The fast, Pythonic way to build Model Context Protocol servers ](https://github.com/jlowin/fastmcp#:~:text=The%20Model%20Context%20Protocol%20,MCP%20servers%20can)). ## Core MCP Concepts and Implementation (Python) Implementing MCP in Python revolves around using context managers, defining tool/resource handlers, managing state, and ensuring memory is handled properly. The **MCP Python SDK** was built with modern Python features (asyncio, type hints, context managers) to make integration as Pythonic as possible. Below, we break down core patterns: ### Context Managers and Lifecycle MCP uses context managers to handle the lifecycle of connections and servers, ensuring proper setup/teardown of resources. For example, to connect an MCP client to a server (which could be a separate process), you might use an async context manager provided by the SDK. The snippet below illustrates a client connecting to a local server via standard I/O: ```python from mcp.client.stdio import stdio_client from mcp import ClientSession # Launch a local MCP server process (e.g., running example_server.py) server_params = StdioServerParameters(command="python", args=["example_server.py"]) async with stdio_client(server_params) as (read, write): async with ClientSession(read, write) as session: await session.initialize() tools = await session.list_tools() print("Available tools:", tools) ``` In this example, `stdio_client(...)` manages launching the server and yielding its read/write streams, and `ClientSession(...)` manages the protocol handshake and messaging ([GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients](https://github.com/modelcontextprotocol/python-sdk#:~:text=async%20def%20run,initialize)). Both use `async with` to ensure that when the code block exits, the connection is closed and subprocesses are cleaned up. This is analogous to using `with open(file)` for files – it’s **resource management for the model’s context channels**. The SDK’s design with context managers prevents resource leaks like dangling processes or file descriptors. On the server side, context managers can manage long-lived resources needed by tools. The Python SDK (especially via FastMCP) allows an **application lifespan** to be defined. For instance, you can use an `@asynccontextmanager` to open resources (like a database connection, a large ML model, etc.) at server startup and ensure they are closed at shutdown ([mcp · PyPI](https://pypi.org/project/mcp/#:~:text=%40asynccontextmanager%20async%20def%20app_lifespan,disconnect)). This pattern is crucial for **memory management**: it avoids re-initializing heavy resources on every request and provides a hook to free them. For example: ```python from mcp.server.fastmcp import FastMCP from contextlib import asynccontextmanager mcp = FastMCP("MyApp") @dataclass class AppContext: whisper_model: Whisper # Initialize a Whisper model once on startup, release on shutdown @asynccontextmanager async def app_lifespan(server: FastMCP) -> AsyncIterator[AppContext]: model = Whisper.load_model("large") # (Pseudo-code: load a Whisper model) try: yield AppContext(whisper_model=model) finally: model.close() # free resources (if needed) mcp.add_lifespan(app_lifespan) ``` In this pseudo-code, the server gains a `whisper_model` in its context that all tool handlers can use without re-loading the model each time. The `yield` in the context manager provides the state to the server, and the `finally` ensures cleanup ([mcp · PyPI](https://pypi.org/project/mcp/#:~:text=%40asynccontextmanager%20async%20def%20app_lifespan,Cleanup%20on%20shutdown)). **This use of context managers in MCP mirrors traditional Python patterns** (setup/teardown), but now applied to AI model tools and data streams. ### Tools and Model Handlers (Functions as MCP Tools) **Defining Tools:** In Python, MCP *tools* are typically implemented as simple functions with type-annotated signatures, registered with the server. Using FastMCP, one can declare a tool with a decorator. For example, a basic calculator tool could be: ```python from mcp.server.fastmcp import FastMCP mcp = FastMCP("Demo") @mcp.tool() def add(a: int, b: int) -> int: """Add two numbers""" return a + b ``` This exposes an “add” tool to the AI, taking two integers and returning an integer ([mcp · PyPI](https://pypi.org/project/mcp/#:~:text=,return%20a%20%2B%20b)). The SDK uses the function’s name, docstring, and type hints to advertise the tool’s interface to the model. Type hints allow MCP to automatically validate and serialize arguments – e.g., if the model calls `add` with a non-integer, the SDK can flag an error. The *difference from a normal Python function* is that the call is coming from the AI’s side (via JSON messages) rather than another Python caller. However, the development experience is similar to writing any Python function, thanks to the decorator abstraction. **Model Handlers (Sampling Callbacks):** While servers handle tools and resources, the *client/host* side must handle the AI model’s outputs. If you are integrating GPT-4 (or GPT-4o) as the model, your client code acts as the “brain” of the assistant. The **Python SDK provides hooks for model output generation**. For example, `ClientSession` accepts a `sampling_callback` – a coroutine that gets invoked when the model needs to produce a response ([GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients](https://github.com/modelcontextprotocol/python-sdk#:~:text=,)) ([GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients](https://github.com/modelcontextprotocol/python-sdk#:~:text=async%20with%20stdio_client,initialize)). In a minimal setup, this callback can simply return a dummy message or an empty result. But in a real integration, you would implement this callback to call the actual LLM API (e.g., OpenAI’s GPT-4) to produce an assistant message. For instance, suppose the AI needs to respond (perhaps after using some tools). We can use OpenAI’s API within the sampling callback: ```python async def handle_sampling_message(request: types.CreateMessageRequestParams) -> types.CreateMessageResult: # request contains the conversation so far and maybe instructions for the model user_content = request.messages[-1].content.text # get last user message text # Call OpenAI ChatCompletion with the conversation openai_messages = [ {"role": msg.role, "content": msg.content.text} for msg in request.messages ] completion = openai.ChatCompletion.create(model="gpt-4", messages=openai_messages) assistant_text = completion["choices"][0]["message"]["content"] return types.CreateMessageResult( role="assistant", content=types.TextContent(type="text", text=assistant_text), model="gpt-4", stopReason="stop" ) ``` This pseudo-code shows how the callback might translate an MCP message request into an OpenAI API call. The `request.messages` would carry the conversation including any tool results injected by the protocol. We repackage them for OpenAI, get GPT-4’s answer, and wrap it as `CreateMessageResult` to send back to the MCP session. Essentially, **the MCP client is acting like a middleman between GPT-4 and the servers**. This is how you integrate a model that doesn’t natively “speak” MCP (like GPT-4 via OpenAI API) into the MCP workflow. The model’s replies and any tool calls it decides to make would go through this orchestration. (By contrast, if one were using Anthropic’s Claude as the model, Claude Desktop handles this internally; with GPT-4, we implement it ourselves.) **State Serialization:** A crucial aspect of any context management is how state is preserved or passed along. In MCP, “state” can refer to a few things: the conversation state (history of messages), the server-side state (any persistent info the server keeps between calls), or tool-specific state (like a cursor in a database). MCP encourages stateless request handling where possible – each tool call is independent – but it also provides mechanisms for stateful behavior. For example, servers can maintain in-memory state in global variables or via the lifespan context (like keeping a DB connection alive). If a tool needs to maintain state across calls (say, a tool that opens an audio stream and processes chunks sequentially), the server implementation would need to handle that (perhaps returning a session ID or storing context in a global dict). MCP doesn’t automatically serialize complex Python objects; you as the developer must decide what context to carry over. Often, simple approaches work: for instance, the server could use a **resource** to represent state. An audio processing server might create a resource like `memory://partial_transcript` that accumulates results, which the AI can read incrementally – effectively serializing state into a resource string or file. When it comes to conversation state **serialization on the client side**, that is typically handled by storing the message history in a Python list (or similar) and sending it with each model API call. This is the same approach as a non-MCP chatbot: you keep a list of messages (user and assistant) and append to it each turn ([Openai-Python Remember Conversation | Restackio](https://www.restack.io/p/openai-python-answer-remember-conversation-cat-ai#:~:text=2,API%20request%20to%20provide%20context)) ([Openai-Python Remember Conversation | Restackio](https://www.restack.io/p/openai-python-answer-remember-conversation-cat-ai#:~:text=4,reply%20to%20the%20conversation%20history)). If needed, you may **persist this list to disk** (e.g., save to a JSON file) so that a conversation can be restored later or resumed if the program restarts. That would be manual but straightforward serialization of state. In an MCP context, since the protocol itself doesn’t persist session state between runs, implementing a save/load of conversation (and any relevant server state, if applicable) is up to the integrator. A practical pattern is to dump the conversation history to a file or database after each turn, and reload it on startup for continuity (particularly useful for long-running voice assistant sessions). ### Memory Management in MCP Memory management has two facets here: **managing memory in Python for audio/model data**, and **managing the LLM’s context window (which is a form of memory for the model)**. For Python memory, we already touched on how context managers prevent leaks (closing files, cleaning subprocesses). You should also be mindful of large objects like audio buffers. For example, loading a long audio file via pydub’s `AudioSegment` will consume a lot of RAM (uncompressed PCM data). If your use case might handle lengthy audio, consider processing it in chunks or streaming it to avoid high RAM usage. Pydub itself can load partial segments, but a common strategy is to break the audio into smaller pieces for transcription ([path - How to successfully transcribe audio files using Whisper for OpenAI in Python? - Stack Overflow](https://stackoverflow.com/questions/76366387/how-to-successfully-transcribe-audio-files-using-whisper-for-openai-in-python#:~:text=They%20also%20recommend%20using%20AudioSegment,the%20audio%20file%20in%20pieces)). This not only saves memory but also keeps each Whisper API call or model inference within size limits. (The OpenAI Whisper API, for instance, only accepts files up to ~25 MB ([Making transcriptions using OpenAI's Whisper - Tilburg Science Hub](https://tilburgsciencehub.com/topics/automation/ai/transcription/whisper/#:~:text=locally,mpga%2C%20m4a%2C%20wav%20or%20webm)), so chunking is required for longer recordings.) You can use `AudioSegment` slicing or ffmpeg to split audio, transcribe each chunk, and then combine transcripts – perhaps summarizing along the way for efficiency. These techniques ensure you don’t need to hold the entire audio in memory at once. On the **LLM context memory** side, remember that GPT-4 and similar models have a limited context window (e.g., 8K or 32K tokens). As you accumulate conversation history and transcribed text, you must be careful not to exceed this window. Best practices include **limiting the stored history** or summarizing older parts ([Openai-Python Remember Conversation | Restackio](https://www.restack.io/p/openai-python-answer-remember-conversation-cat-ai#:~:text=,Retaining)). For example, you might keep only the last N interactions in full detail and either drop or compress earlier content. This can be done by occasionally calling GPT-4 to summarize older messages and replacing them in the history with a concise summary (thus freeing up space for new interactions). OpenAI’s documentation and community suggest truncating or summarizing to manage token limits ([Openai-Python Remember Conversation | Restackio](https://www.restack.io/p/openai-python-answer-remember-conversation-cat-ai#:~:text=,stored%20in%20the%20conversation%20history)). Since your project uses Python 3.11, you can also leverage efficient data structures (like `deque` for a rolling window of messages). In MCP’s terms, one could even imagine an MCP **Memory server** that stores conversation logs or vector embeddings of past dialogue and retrieves relevant pieces on demand – indeed, the MCP ecosystem provides a “Memory” server that implements a knowledge graph for persistent memory across sessions ([GitHub - modelcontextprotocol/servers: Model Context Protocol Servers](https://github.com/modelcontextprotocol/servers#:~:text=%2A%20Memory%20,io)). In a simpler setup, though, a Python list and prudent history management suffice. Finally, cleaning up resources is important. If your audio processing writes a temporary WAV file (for Whisper input after converting formats), ensure you delete it after use. If you open file handles, close them promptly (using `with open(...)` or `.close()` in `finally`). If you allocate large objects (like a loaded ML model) and you’re done with them, you might let them be garbage-collected or explicitly `del` them and call `gc.collect()` in tight loops to avoid bloating memory. The Python SDK’s lifecycle hooks (like the `app_lifespan` above) give a convenient place to do such cleanup at the end of a session or when shutting down an MCP server ([mcp · PyPI](https://pypi.org/project/mcp/#:~:text=%40asynccontextmanager%20async%20def%20app_lifespan,Cleanup%20on%20shutdown)). ## Integrating MCP with Audio Processing Integrating MCP into an audio transcription pipeline (with Whisper) and a GPT-4o interaction loop involves treating audio handling as a tool/resource that the model can invoke or use, and preserving context between the audio and text domains. The objective is to allow a seamless flow where an audio input is processed, turned into text, fed into the LLM, and the LLM’s response (text or even audio) is managed – all while following MCP’s structure for consistency and maintainability. ### Efficient Audio Stream Handling **Audio as Input:** In a traditional setup, if a user provides an audio file, your Python code might directly call Whisper to get a transcript, then send that transcript to GPT. With MCP, we can formalize this into a tool – say, `transcribe_audio`. We would implement an MCP tool function that takes an audio file (or bytes) as input and returns text. The question is how to pass the audio data into that tool call. Since MCP tool arguments are typically JSON-serializable, sending raw binary isn’t straightforward. A common solution is to use a **resource** to handle large data. For example, the client (or user) could first register the audio file as a resource that the server can access (perhaps by placing it in a known directory and referring to it as `file://path/to/audio.wav`). Then the tool can take the URI or file name as a parameter, load the file from disk, and process it. In practice, for a CLI-based flow, you might not need the AI itself to initiate transcription – you could have the application logic detect an audio input and call the tool directly. But in an *agentic* scenario, the model could decide to call `transcribe_audio` when it encounters an audio resource. For instance, if using GPT-4o which supports voice, the model (or the system prompt) might indicate an audio file needs transcribing, triggering the tool. **Implementing the Transcription Tool:** Using the OpenAI Whisper API via the `openai` Python library is straightforward. Here’s how we can implement a tool in the MCP server for transcription, using pydub for format handling: ```python @mcp.tool() def transcribe_audio(file_path: str) -> str: """Transcribe an audio file to text.""" # Use pydub to ensure consistent format (e.g., convert to wav mono) audio = AudioSegment.from_file(file_path) audio = audio.set_frame_rate(16000).set_channels(1) temp_wav = "/tmp/temp.wav" audio.export(temp_wav, format="wav") # Call OpenAI Whisper API to transcribe with open(temp_wav, "rb") as f: result = openai.Audio.transcribe("whisper-1", f) text = result.get("text", "") return text ``` This tool expects a file path string. The steps are: load the file with **pydub** (which supports various formats like mp3, mp4, etc.), normalize it to a WAV with the desired sample rate (Whisper often works best at 16kHz mono), export it to a temp file, then call `openai.Audio.transcribe` on that file object. Finally, it returns the transcribed text. The OpenAI API will return a JSON including the transcription under `"text"` ([Making transcriptions using OpenAI's Whisper - Tilburg Science Hub](https://tilburgsciencehub.com/topics/automation/ai/transcription/whisper/#:~:text=)). We extract it and send it back. All of this happens inside the tool function, so from the AI’s perspective, calling `transcribe_audio("user_upload.mp3")` yields a text string result. A few notes on efficiency: if performance is a concern or if you plan to transcribe very large files, you could enhance this function by splitting audio into chunks (using `AudioSegment[:segment_duration]` in a loop) to avoid hitting size limits, as suggested by others ([path - How to successfully transcribe audio files using Whisper for OpenAI in Python? - Stack Overflow](https://stackoverflow.com/questions/76366387/how-to-successfully-transcribe-audio-files-using-whisper-for-openai-in-python#:~:text=They%20also%20recommend%20using%20AudioSegment,the%20audio%20file%20in%20pieces)). Each chunk could be sent to Whisper and partial transcripts concatenated. Also, the above uses the OpenAI cloud Whisper (“whisper-1” model) which offloads the heavy computation to OpenAI’s servers. If instead you wanted to use local Whisper (via `openai-whisper` or `whisper.cpp`), you might load a model in memory during server startup (in lifespan) and run it inside the tool – but that requires a lot of resources (and possibly GPU) for large models. Using the API is simpler if an API key is available, and with pydub you can compress to reduce file size (OpenAI Whisper supports MP3 input too; you could skip conversion and send the MP3 bytes directly – the code above converts to WAV mostly to ensure 16k mono, but OpenAI might do that internally if given other formats). **Audio Stream vs File**: If your application ever needs *real-time streaming* (e.g., live microphone input), MCP can handle it but in a slightly different way. Instead of a single `transcribe_audio` call, you might have a loop that sends chunks of audio as they are recorded. This could be modeled as a tool that takes a chunk and appends to a resource (like a rolling transcript resource), or the client could accumulate audio and call the tool periodically. For initial integration, it’s likely simpler to handle entire files per turn, as in the CLI scenario (the user speaks or provides a file, you transcribe fully, then respond). Ensure the tool execution is efficient to not stall the conversation too long; if needed, spawn the transcription in a background thread and stream a “typing…” message to the user. **Preserving Audio Quality and Efficiency:** Using pydub introduces some overhead (decoding and re-encoding audio). You might streamline by directly invoking ffmpeg via subprocess for conversion, or by using `openai.Audio.transcribe` on the original file format. In fact, OpenAI’s documentation indicates you can pass an MP3 file object and it will handle it ([Making transcriptions using OpenAI's Whisper - Tilburg Science Hub](https://tilburgsciencehub.com/topics/automation/ai/transcription/whisper/#:~:text=Hub%20tilburgsciencehub,a%20json%20file%20of)). For example: ```python with open(file_path, "rb") as f: result = openai.Audio.transcribe("whisper-1", f) ``` This likely would work for common formats and might eliminate the need for the `AudioSegment.export` step, as long as the audio is under size limits and correctly encoded. Pydub remains useful for chunking or if you need to preprocess (noise reduction, etc.). ### Context Preservation Between Processing Steps One of the challenges in a multi-step pipeline (audio -> text -> GPT -> maybe text-to-speech) is preserving the **user’s intent and context** across each step. With MCP, part of this is handled by the conversation paradigm: each turn’s output becomes part of the context for the next turn. Let’s break down how state and context can be threaded through: - **Audio Conversion State:** If you perform an audio format conversion or compression (say, create that `temp.wav`), that state (the temp file path, etc.) doesn’t need to persist beyond the transcription call. It’s an internal detail of the tool. After `transcribe_audio` returns, we can delete the temp file and forget about it. Thus, there’s no long-term context to preserve at the application level for audio conversion – just ensure cleanup to not clutter disk. Using a context manager for the temp file or a `try/finally` in the tool can ensure deletion. - **Transcription Result State:** The transcript is the key piece of state that must carry forward. Once the tool returns the text, the **MCP client (AI)** will include that text in the conversation. In an agent scenario, the model might get the text and then continue its response. If you are orchestrating manually, you will take the transcribed text and append it as the **user’s message** in the conversation history list. For example: `conversation_history.append({"role": "user", "content": transcript_text})` ([Openai-Python Remember Conversation | Restackio](https://www.restack.io/p/openai-python-answer-remember-conversation-cat-ai#:~:text=4,reply%20to%20the%20conversation%20history)). By doing so, you preserve the user’s spoken input in textual form as part of the context for GPT-4. This is crucial – GPT-4 can’t “remember” the audio itself, it needs the transcription to know what the user said. So effectively, the output of the `transcribe_audio` tool becomes a new message in the chat history. In an interactive MCP host, this could be automated (Claude, for instance, would automatically incorporate tool outputs into the chat context). In your custom integration, you’ll do it in code: treat the transcribed text as if the user had typed that message. - **Model Response Generation:** Now GPT-4 (or GPT-4o) generates a response based on the conversation (which now includes the user’s query in text). That response is obtained via the OpenAI API call (as shown in the sampling callback or even simpler: `openai.ChatCompletion.create(model="gpt-4", messages=conversation_history)`). You append the assistant’s reply to the history as well ([Openai-Python Remember Conversation | Restackio](https://www.restack.io/p/openai-python-answer-remember-conversation-cat-ai#:~:text=4,reply%20to%20the%20conversation%20history)). At this point, you have maintained a coherent conversation state that includes the audio-derived message. The **context has been preserved**: even though the user spoke, the system’s memory of it is a text message from the user. - **Conversation History Management:** Over multiple turns, continue appending user (whether from audio or text input) and assistant messages to the `conversation_history`. If a new audio comes in the next turn, again transcribe it and append. This way, the dialogue builds naturally. As noted, remember to trim this history if it grows too large (perhaps keep last ~50 messages or less, or use summarization) ([Openai-Python Remember Conversation | Restackio](https://www.restack.io/p/openai-python-answer-remember-conversation-cat-ai#:~:text=,Retaining)). The MCP paradigm doesn’t change how you manage this list; it just provides a structured way to get the content for each user turn (via tools like transcription or via direct text input). - **Error Handling and Continuity:** If an error occurs during transcription (say the audio file was corrupt or Whisper API failed), design the system to handle it gracefully and still maintain context. For example, if `transcribe_audio` throws an exception, you could catch it and return an error message like `"[Error: transcription failed]"`. The AI can then see that and perhaps ask the user to repeat or provide another file. Importantly, you might choose *not* to add a failed transcription as a user message (since it wasn’t understood). Or you could add a system message indicating error. The approach depends on how you want the conversation to flow. The MCP protocol itself would allow returning an error string or raising an MCP error result that the host could turn into a message. Implementing robust error returns in tools keeps the conversation from crashing. For instance, wrapping the transcription in a try/except and returning `"Error: "+str(e)` as the tool result ensures the AI gets a textual indication of failure rather than nothing ([GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients](https://github.com/modelcontextprotocol/python-sdk#:~:text=return%20f)). - **Preserving Speaker Identity:** In a conversation, it’s clear which messages are the user’s and which are the assistant’s. For voice input, after transcription, you should still mark it as a **user** role message when appending to history. The assistant’s responses remain assistant role. MCP deals with roles explicitly in the message objects (user, assistant, system) ([Openai-Python Remember Conversation | Restackio](https://www.restack.io/p/openai-python-answer-remember-conversation-cat-ai#:~:text=4,reply%20to%20the%20conversation%20history)). Maintaining correct roles ensures GPT-4 knows what is user input vs. its own previous output. - **Direct GPT-4o Audio Input:** If using GPT-4o (the “omni” model that can handle audio natively ([Hello GPT-4o - OpenAI](https://openai.com/index/hello-gpt-4o/#:~:text=Hello%20GPT,and%20text%20in%20real%20time))), the workflow may slightly differ. GPT-4o allows sending an audio file directly as part of the prompt (similar to how GPT-4 vision allows an image input). If the OpenAI API supports this (e.g., by specifying the audio in the messages or via a special parameter), you could bypass explicit Whisper calls. However, under the hood OpenAI is likely performing transcription anyway – the difference is they do it server-side. From an integration perspective, you might send the audio and receive text answer directly. **One must consider how to log the conversation** in this case: presumably, GPT-4o will internally convert the audio to text (they mention it “reasons across … audio” ([Hello GPT-4o - OpenAI](https://openai.com/index/hello-gpt-4o/#:~:text=Hello%20GPT,and%20text%20in%20real%20time))). It’s wise to capture what the model interpreted. If the API returns the text it transcribed (some APIs might, or you might have to prompt GPT-4o to repeat the query in text), you can then insert that into your history for completeness. If not, you might transcribe simultaneously on your side just for the record. In any event, for consistency and debugging, having the transcript is useful. The rest of the conversation handling remains the same. The benefit of GPT-4o is potentially reduced latency (one less API call) and leveraging OpenAI’s integrated speech recognition tuned for GPT. But the downside is less control – you can’t easily fix a transcription error if you don’t have the intermediate text. With your own Whisper step, you could apply custom correction or ask user for clarification if needed. - **Multi-modal Responses:** GPT-4o can also generate audio outputs (by synthesizing voice) as indicated by OpenAI’s notes ([OpenAI Chat :: Spring AI Reference](https://docs.spring.io/spring-ai/reference/api/chat/openai-chat.html#:~:text=The%20gpt,use%3A%20text%20%2C%20audio)). If you plan to support the assistant speaking back, you might treat that as another step – either using GPT-4o’s audio output directly or using a TTS engine on GPT’s text reply. MCP could facilitate this by having a *tool for TTS* (text-to-speech) if using an external TTS system. For example, after GPT generates text, you could call a `speak_text` tool that uses a library or API (like ElevenLabs or pyttsx3) to produce an audio file, and then return a path to that audio or play it to the user. This would make the system fully voice-interactive. The conversation state would now also include possibly a reference to what was spoken. (Though typically you don’t include the assistant’s spoken audio in context – the text is enough). In summary, **context preservation** in an audio-GPT workflow means ensuring that the result of each step (especially the transcription) is fed into the next step in a form the model can use (text), and that the conversation history faithfully records what has transpired. MCP helps by structuring the transcription as a callable tool, but you still manage the conversation list and call the model, much like a standard chat application ([Openai-Python Remember Conversation | Restackio](https://www.restack.io/p/openai-python-answer-remember-conversation-cat-ai#:~:text=2,API%20request%20to%20provide%20context)) ([Openai-Python Remember Conversation | Restackio](https://www.restack.io/p/openai-python-answer-remember-conversation-cat-ai#:~:text=4,reply%20to%20the%20conversation%20history)). ## Multi-Modal Context Management Multi-modal here refers to handling both **audio and text modalities (and potentially others like images)** in one cohesive context. When integrating voice input/output with a text-based model like GPT-4, you are effectively bridging modalities: user’s speech -> text -> model -> text -> (speech). The challenges include synchronizing state between these different modes and recovering smoothly from errors in one modality without losing overall conversational context. ### Managing Context Across Audio and Text To the GPT-4 model, everything ultimately appears as text (unless using a special multi-modal model). So the trick in multi-modal context management is to **ensure that each modality’s content is represented in the shared context**. We’ve done this for audio by transcribing it to text. If there were another modality (say images), you’d analogously describe or label them (or use OCR for text in images, etc.). The general principle is: convert all inputs to a format the model understands and maintain a unified memory of interactions. In code, the conversation history data structure will contain entries for each turn, regardless of input type. You might add a prefix or note to indicate modalities, e.g., `{"role": "user", "content": "[Voice] How is the weather today?"}` or even store metadata like `{"role": "user", "content": "How is the weather today?", "via": "audio"}`. The model doesn’t see the metadata (unless you include “[Voice]”), but your system might log it. For the model, it’s just text content from the user. If GPT-4o is used and it produces an audio answer, you likely also get or convert that answer to text to store as the assistant’s message (the user might hear the audio, but for the next turn, you need the text of what was said to feed back into the model’s context). Thus, even outputs from different modalities are converted to text for persistence. The **conversation state thus remains modality-agnostic** in structure, even though modalities were used in interaction. **Synchronizing State**: If you have multiple components – say one component recording audio, another transcribing, another handling the chat logic – you need to sync their operation. A common approach is to use an asynchronous loop or callbacks: for example, once audio recording is done (or on each chunk), trigger the transcription and wait for result, then feed to chat. If these are separate threads or processes, use thread-safe queues or signals. The MCP server/client separation can actually help here: the audio recorder could be an MCP *client* pushing audio data to a transcription *server*, and the chat loop is another client that waits for the transcription result resource to appear. However, that might be over-complicating. Simpler: do it sequentially in one thread (record -> transcribe -> chat -> TTS -> loop). The state to synchronize is essentially “the content of the last user turn” and “the conversation memory”. Ensure that when a new turn begins, you clear any temporary data from the last turn (like intermediate audio buffers), but keep the persistent context (the conversation history list). One must also synchronize the **identity** of the conversation across modalities. If you allow, say, typed input *and* voice input from the same user interchangeably, all should go into the one history. This means your code path for text input and audio input should both ultimately call a common function to process a user query that updates the same `conversation_history`. This avoids divergence where voice inputs and text inputs create separate threads of conversation. In a CLI, you might not have simultaneous modes, but in a GUI you might. It’s worth designing for consistency: e.g., if user types something while an audio reply is playing, decide how to handle that (queue the input or interrupt the output). These are UI/UX specifics beyond MCP, but they impact context management because overlapping turns can confuse the state. ### Error Recovery and Context Restoration In any multi-step pipeline, errors can occur at various points: microphone failure, file read error, transcription misrecognition, API error from GPT, etc. **Error recovery** means handling the error and continuing the interaction without starting from scratch. The MCP approach encourages modular handling: if a tool (like transcription) errors, the server can return an error message or code, and the host (AI) can react. For example, the transcription tool could return `"[Transcription failed]"` which you then treat as the user’s input (perhaps leading the AI to say “I’m sorry, I didn’t catch that. Could you repeat?”). This way, the *conversation context still moves forward* – albeit with an error indication – rather than breaking. Another approach is to not add anything to history for that turn and simply prompt the user again externally. But it’s often useful to inform the model of what happened, especially if the model will decide how to respond. Including an error note as a system message or part of user message might yield the model to politely ask for retry. In MCP, the standard way to indicate tool failure is to throw an exception in the tool server, which the protocol will capture and send to the client as an error response. Your client can catch that and decide to convert it into a user-visible message. **Context restoration** refers to the ability to pick up a conversation after a disruption or over multiple sessions. Suppose the program crashes or is stopped in the middle of a conversation. If you saved the `conversation_history` to disk (e.g., as JSON) after each exchange, you can reload it on restart. Then, you can instantiate the conversation with GPT-4 by replaying those messages to it (perhaps summarizing if too long). In practice, GPT models don’t have an actual “session memory” – you always send the history each time – so restoration is just reloading and continuing the list. If using anthropic’s Claude via Claude Desktop (which is not your case, but for comparison), the Claude app manages persistent conversations for you. In your CLI, you might implement a command like `--resume session1.json` to load past context. Just be mindful of the context window limits if the conversation was long; you might need to truncate old parts or summarize them upon restoring. From the server perspective, if any **long-lived state** was present (not likely in your audio/GPT scenario, but imagine a tool that logs data or has an internal memory of user preferences), you’d also want to persist that (maybe in a small file or database). For example, a “conversation memory” MCP server using a vector store would save embeddings to a file so that after a restart, it can reload and continue providing relevant info. In summary, robust multi-modal context management means: every input is captured in the context in text form, the system can recover from failures by informing either the user or the model (without losing past context), and the entire state can be saved/restored if needed to maintain continuity across runs. Using MCP doesn’t automatically guarantee this, but its structured approach makes it easier to isolate where each piece of data flows and thus where to add error-handling and persistence. For instance, with MCP you know the transcription happens in one place (the tool), so you can focus error handling there for that step, etc. ## Performance Considerations When integrating audio processing and large-language-model interactions, performance can be a concern in terms of speed, memory usage, and cost (if using API calls). We’ll examine some key performance aspects and how to optimize them in this context, while adhering to MCP patterns. ### Memory Efficiency in Audio Handling Processing audio can be memory-intensive. For instance, loading a 5-minute stereo WAV file into memory can easily consume tens of MBs. If you convert that to raw PCM, it balloons further (since formats like MP3 are compressed). **To keep memory usage in check**: - **Stream or chunk large files:** As mentioned, use pydub or ffmpeg to process in chunks instead of one giant blob. You can load a segment of an AudioSegment by slicing (AudioSegment supports slicing by milliseconds). For example, process 1-minute chunks sequentially. This way, you only hold a small buffer at a time. - **Avoid unnecessary copies:** When possible, work on file handles or bytes streams instead of duplicating data. If you have an audio file on disk, you don’t necessarily need to read it fully into a Python bytes object just to pass to `openai.Audio.transcribe`; you can open a file object and pass it, which streams data under the hood. Similarly, if a tool outputs a large text (say transcribing a long audio), ensure you don’t accidentally copy that text multiple times (just return it or stream it out). - **Memory overhead of MCP:** The MCP protocol messages are JSON-based (or similar), but these are tiny compared to audio or model data. The overhead of keeping a list of tools and resources in memory is negligible. The main memory load comes from the actual data you handle (audio waveforms, model prompts, etc.). So focus optimization efforts there. - **Use of async and concurrency:** Python’s `asyncio` can help interleave operations (e.g., transcribe chunk 2 while chunk 1’s result is being sent to GPT, etc.), but due to GIL, pure Python won’t do two CPU-heavy tasks truly in parallel. If heavy CPU work (like local Whisper transcription) is happening, consider offloading to threads or subprocesses to utilize multiple cores. The Python SDK and FastMCP are async-friendly, so you can call tools concurrently if needed. Just be cautious with thread safety if your tool uses global models. ### Context Switching and Protocol Overhead **Context switching** can refer to a few things: switching between the audio processing and model processing tasks, and switching between the MCP host and server communication. In terms of OS context-switch or thread-switch overhead, those are minimal relative to the tasks at hand. Transcribing audio or calling a GPT-4 API call both have latencies in the order of seconds (for long audio or complex prompts), whereas a context switch is microseconds. So, using a couple of additional `async with` context managers or sending JSON over STDIO does not meaningfully slow things down. For example, the JSON-RPC over STDIO that Claude Desktop uses to talk to MCP servers is very fast and local ([Introducing the Model Context Protocol](https://simonwillison.net/2024/Nov/25/model-context-protocol/#:~:text=Their%20first%20working%20version%20of,standard%20input%20and%20standard%20output)). The bigger overhead might be if you run a separate process for the MCP server: launching a new process for every query would be expensive, but that’s not how MCP is intended to be used. You launch the server once and keep it running, so each tool call is just an inter-process message. **MCP’s design tries to minimize overhead** by keeping connections open and reusing them. One area to watch is **MCP’s event loop latency**. If you have many asynchronous operations, make sure you await them properly. Also, if you were to use the HTTP(SSE) transport in the future (as the spec is adding ([Introducing the Model Context Protocol](https://simonwillison.net/2024/Nov/25/model-context-protocol/#:~:text=machine%20is%20the%20only%20way,to%20try%20this%20out))), network latency could be introduced. But for now, focusing on local usage with STDIO, the overhead per call is small (parsing a JSON and writing a JSON). That said, every additional step does add some latency. For example, if the model needs some info, calling an MCP tool involves the model deciding to call it, the request being sent to server, server processing (could be heavy like a DB query or audio crunching), and response back. This can be slower than if you had inlined the operation in code. But the trade-off is flexibility. To mitigate this, you can **pre-fetch or cache results** (discussed below) to avoid repetitive tool calls. Also, design the system to do as few context switches as necessary. For instance, if you know you’ll need the transcript of audio, it might be faster to always transcribe immediately (one step) rather than, say, having the model ask for a resource in multiple turns. In your pipeline, you likely do it in one go (transcribe then feed to model), which is optimal. If instead you had the model first call a tool to get audio file info, then another to transcribe, that’s two round-trips; consolidating into one `transcribe_audio` call is better. In summary, **the MCP mechanism itself has low overhead**, but orchestrating fewer steps where possible will improve user-perceived performance. ### Caching Strategies **Caching** can dramatically improve performance if the same expensive operations are repeated. In an audio-to-text scenario, a clear candidate for caching is the transcription result. For example, if users might submit the same audio file multiple times (or if your system processes some known audio routinely), you could cache the transcript keyed by a hash of the audio file. Next time, skip the Whisper call and instantly return the cached text. This could be as simple as storing results in a dictionary or a small SQLite table. Given Whisper’s significant compute, caching its output for identical inputs is worthwhile. Just ensure to manage the cache size if needed (evict old items, etc.). Similarly, if your conversation tends to repeat certain questions, caching **model responses** could be useful. However, caching GPT outputs is tricky since even slight differences in phrasing can lead to different answers and you risk staleness if the context differs. A safer model caching is to cache any expensive *tool* outputs (like search results for a given query, etc.). In your project, the primary heavy tool is transcription, which we addressed. Another cache: if using local models (Whisper or GPT), caching model **weights in memory** is critical (don’t reload them each time). That’s already handled by using the lifespan to load once. If using OpenAI API, the “cache” is effectively OpenAI’s servers and your network – you can’t cache their inference, but you can reduce calls. One could also cache intermediate audio conversions. For example, if you always convert audio to `temp.wav`, you might skip conversion if you detect the file is already a correct WAV (or if you have previously converted this file and it’s unchanged). At a higher architecture level, the **MCP servers could use caches** internally. The MCP ecosystem even includes a Redis-based server for caching and key-value storage ([GitHub - modelcontextprotocol/servers: Model Context Protocol Servers](https://github.com/modelcontextprotocol/servers#:~:text=match%20at%20L574%20,designed%20for%20interacting%20with%20the)). In an advanced setup, your transcription server could store past transcripts in a Redis cache (perhaps with the audio file path or a content hash as key). The tool function would first check the cache, and only call Whisper if a cache miss. This decouples caching logic from the main code and leverages a robust in-memory store. But for a single-user CLI, a simple Python dict or file cache is fine. One more note: caching is most effective when the same input is reused. If every audio is unique (user always says something new), caching won’t help beyond perhaps model reuse. In that case, focus more on throughput (ensuring parallelism if needed) rather than caching. ### Resource Cleanup and Optimization Proper cleanup not only prevents memory leaks but also avoids cumulative slowdowns. For instance, if you don’t close file handles, you could eventually hit OS limits. Or if you leave temporary files, disk clutter could affect performance. Always remove or reuse temp files. If using any external process (ffmpeg, etc.), ensure it terminates. **Garbage Collection:** In Python, large objects may linger until GC kicks in. After finishing a big transcription, you could delete the large data (e.g., the AudioSegment and raw audio bytes) if they’re no longer needed, to free memory promptly. Python will eventually free them, but it might wait if references remain or if cyclic refs exist. You generally don’t need to manually call `gc.collect()`, but in long-running processes that handle many large objects in sequence (like transcribing hundreds of files), it’s not unreasonable to do so occasionally to avoid memory bloat. **Parallel considerations:** If you run multiple tasks, be mindful of not oversaturating the CPU. For example, transcribing audio and generating a GPT response in parallel might seem faster, but both can be CPU/GPU heavy (unless one is fully offloaded to an external service). If using OpenAI’s API for GPT, that just waits on network, so you *could* transcribe the next audio while waiting for GPT’s answer to the last query – pipelining to improve throughput. That’s an optimization if you expect back-to-back requests (like a user queued multiple questions). But for interactive use, it might complicate things with little gain, since typically you transcribe *then* ask GPT. **Benchmarking**: It’s useful to measure where the time goes in your pipeline. You might log timestamps before and after each major step (audio loading, transcription, model call, etc.). Likely, you’ll find transcription (if using the API) takes a few seconds for short audio, and GPT-4 can take a couple seconds for a response, depending on prompt length. These are the dominating factors. The overhead from glue code (MCP call, etc.) will be maybe a few milliseconds. Knowing this, optimizations like chunking audio or reducing prompt size can yield bigger speedups than micro-optimizing Python loops. In terms of raw performance numbers: Whisper large model can transcribe ~* realtime on a high-end GPU (i.e., a 30-second audio in around 30 seconds) – the API is probably faster since OpenAI might use optimized inference. GPT-4 typically has a latency ~1-2 seconds for a short prompt, more if the response is long (because it’s streaming tokens). So the user might wait say 3-5 seconds after speaking to get a reply – not instantaneous, but acceptable for many cases. To improve perceived performance, one strategy is to **stream outputs**: for example, show partial transcription as it’s being decoded (if using a library that supports it, or perhaps use OpenAI’s Whisper API via SSE to get interim results), or stream the GPT answer as it arrives token by token. Streaming can be implemented by reading the OpenAI event stream in Python (setting `stream=True` in ChatCompletion). This keeps the user engaged and masks the true latency. MCP doesn’t conflict with streaming – you can stream tool outputs to the model if the model supports it. The MCP spec has a notion of output streaming for tools (e.g., a tool can yield partial results), but handling that is more complex. Simpler is streaming the final model answer to the UI. Finally, keep your dependencies lean and updated. `openai` library and `pydub` are already in use; ensure they are up-to-date for performance and bug fixes. Python 3.11 offers some speed improvements over 3.10, especially in interpreter performance – a nice benefit for free. Utilize type hints not only for clarity but also they can be checked with static analyzers or at runtime (with libraries like pydantic or attrs if needed) to catch issues early, which indirectly aids reliability (and thus performance in the sense of less downtime debugging). ## Implementation Patterns and Best Practices To successfully integrate MCP into your existing audio-GPT project, consider several implementation patterns and best practices, from coding style to testing. The aim is to maintain code quality and reliability while adding MCP capabilities. ### MCP Integration Patterns in Python Leverage the **official Python SDK (mcp)** and especially the **FastMCP** helper to integrate quickly. The pattern is: **define your tools and resources as functions** with clear type hints, register them with a FastMCP server, and run that server either embedded or as a separate process. For example, in your project’s structure, you might create a module `mcp_server.py` where you set up `FastMCP("AudioGPT")` and add tools like `transcribe_audio` and perhaps a few others (maybe a tool to fetch conversation history or to perform some domain-specific tasks if needed). Once that is done, you can launch this MCP server from your CLI (perhaps with a flag `--mcp` to start it). If the server is running, an external MCP-compatible client (like Claude Desktop or another script) could connect to it. However, since you want to use GPT-4 as the model, you may instead run your own **MCP client** loop internally. That is, your main program could instantiate the server (as above), and then also use `ClientSession` to converse with it, effectively acting as the host. This is a bit unusual (embedding both client and server in one app), but it’s possible via asyncio. Alternatively, run the server as a subprocess and connect via `stdio_client` as in the earlier example ([GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients](https://github.com/modelcontextprotocol/python-sdk#:~:text=async%20def%20run,initialize)). The pattern here would be similar to the `mcp-chatbot` repository you found – they demonstrate using GPT-4o (via OpenAI API) as the model and MCP for tools. In fact, their architecture might be instructive: they dynamically included tool info in the system prompt for the model ([GitHub - 3choff/mcp-chatbot: A simple CLI chatbot that demonstrates the integration of the Model Context Protocol (MCP).](https://github.com/3choff/mcp-chatbot#:~:text=,in%20the%20system%20prompt%2C%20allowing)). You can do similarly: when calling GPT-4, include a description of available tools in its prompt (unless you parse the model output yourself for tool usage). OpenAI’s function calling feature could be used as an alternative approach to MCP, but since we focus on MCP, you’d effectively be parsing model outputs that say “I want to use X tool” or have the model output a special format that triggers a tool call. This area can get complex – some MCP hosts handle it automatically, but if you roll your own with GPT-4, you’ll need to implement a **decision loop**: model sees user input -> model decides (possibly via some prompt engineering) if a tool is needed -> if yes, call tool -> insert result -> model continues. One *simpler* pattern: always transcribe audio outside the model (since that’s clearly needed) and feed the text to GPT. Only incorporate more complex tool usage if necessary for other features. Maintain **clear separation of modules**: e.g., `audio_utils.py` for audio processing (conversion, maybe a helper to call transcribe tool or API), `chatbot.py` for conversation management (history and GPT calls), and `mcp_server.py` for the MCP server definition. This modularity makes testing easier and aligning with MCP straightforward, since MCP server functions can often just call into your existing utilities. For instance, `transcribe_audio` tool might just call a function in `audio_utils.py` that you already have for converting and transcribing. ### Type Hints and Runtime Checks Your project is on Python 3.11, so you can fully utilize type annotations. As seen, MCP tools defined with type hints automatically ensure type correctness for inputs/outputs ([mcp · PyPI](https://pypi.org/project/mcp/#:~:text=,return%20a%20%2B%20b)). This is powerful – it’s like having an automatic schema for your tool’s JSON interface. Use this to your advantage by specifying precise types (e.g., `file_path: str`, or if you accept either raw bytes or str, you might choose `Union[str, bytes]` but that could complicate things). The SDK likely serializes basic types (int, float, str, bool, dict, list) easily. If you need to pass more complex data (like a list of numbers or JSON), you can use those types in function signatures and the SDK will handle the JSON encoding/decoding. Even outside the MCP aspect, use type hints in your code. This helps with static analysis (mypy) and documentation. When interfacing with external libraries like pydub or openai, be mindful of their expected types (e.g., openai.Audio.transcribe expects a binary file-like object for the file). For runtime checks, you might incorporate assertions or explicit error handling. For example, if your `transcribe_audio` tool for some reason got a non-existent file path, you should catch the `FileNotFoundError` and perhaps raise a ValueError with a clear message, which MCP will return as an error. This prevents confusing the model with a stack trace. MCP will typically catch exceptions and translate them, but giving a user-friendly error (or model-friendly error) is best. For instance: ```python @mcp.tool() def transcribe_audio(file_path: str) -> str: if not os.path.exists(file_path): raise ValueError(f"File not found: {file_path}") # ... continue with transcription ``` This way, the error is clean and can be shown to the user or logged. ### Error Handling Strategies We’ve touched on some error handling, but let’s summarize systematically: - **Within Tools:** Handle known failure modes and throw meaningful exceptions or return error-indicating values. For example, catch exceptions from `openai.Audio.transcribe` (maybe network errors or API errors). The OpenAI library might throw an `openai.error.OpenAIError`. You could catch that and raise a custom error or return an “error: ...” string. Decide whether you want the AI to see the error or just the user. Sometimes, letting the AI see the error is useful (the AI could apologize or ask differently), but often it’s better for the system to handle it. Since this is a CLI tool, you might handle it by printing a message to console and not involving GPT in it. - **At the Integration Level:** If the GPT API call fails (timeout, rate limit, etc.), catch those and possibly retry or prompt the user. Rate limiting could be an issue if transcribing many files quickly or rapid GPT calls. OpenAI might return a 429 error. Implement a backoff or at least an informative message (“The service is busy, retrying in a few seconds…”). This improves resilience. - **MCP Connection Issues:** If you run the MCP server as a separate process, handle the case where it isn’t available or crashes. The `ClientSession.initialize()` might fail if the server died. You should detect that and attempt to restart the server or alert the user. In a tightly integrated CLI, you might choose to run everything in one process to avoid this. But if separate, include a watchdog. - **Graceful Shutdown:** Ensure that when your program exits, you shut down the MCP server process if it was started, and close any audio devices if they were open. This prevents orphan processes (e.g., ffmpeg left running or the subprocess not terminated). Using the context managers as shown takes care of a lot of this automatically ([GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients](https://github.com/modelcontextprotocol/python-sdk#:~:text=async%20def%20run,initialize)), but be mindful if any part of your code breaks out of the `with` prematurely due to an exception – in those cases, the context manager’s __exit__ should still run, but only if it was entered. Structured exception handling around the main logic can catch anything unhandled and perform cleanup. - **Logging and Monitoring:** It’s a good practice to log errors to a file. You might have a debug log that records each step (especially tool calls and model calls). This helps in understanding failures in hindsight. Since this is a CLI app, you can use a standard logging library to output to console or file with appropriate verbosity levels. ### Testing Best Practices Given the project already has `pytest` as a dependency, you should extend your test suite to cover the new MCP-integrated functionality. Here are some strategies: - **Unit Test Tools:** Test the MCP tool functions in isolation. For example, write a test for `transcribe_audio` that doesn’t actually call the OpenAI API but instead mocks it. You can monkeypatch `openai.Audio.transcribe` to return a preset value. Using `pytest.monkeypatch` or the `unittest.mock` library, you can simulate a successful transcription and an error scenario. Also test edge cases (nonexistent file, empty file, very short audio). Similarly, if you add other tools, test them with representative inputs. These tests ensure your logic (like pydub processing and error handling) works as expected. - **Integration Test Flow:** You might write an integration test that simulates a full conversation turn. For example, provide a small audio file (you can include a tiny WAV saying “hello” in your test assets) and then run through: transcribe it (maybe monkeypatch Whisper to just return "hello"), then feed to a dummy GPT model. For tests, you likely **don’t want to call the real GPT-4 API** (costly and requires network). Instead, you could monkeypatch `openai.ChatCompletion.create` to return a fixed response for given inputs. For instance, always return “Hi there!” for any prompt. This allows you to test the control flow: audio in -> message out, conversation history updated, etc., without external dependencies. - **CLI Tests:** If your CLI is triggered via an entry point, you can use `pytest` to run it as a subprocess (using Python’s `subprocess` module or `pytest`’s `capsys` for capturing output of a function). For example, if `main.py` has a `main()` that reads args and orchestrates, you can call `main()` within a test with certain parameters (like `--input tests/data/hello.wav`). Verify that the output (printed to stdout) matches expected transcript or response. You might need to inject dummy API keys or disable actual API calls via mocks in this context as well. - **Performance Tests (if feasible):** You could have a test that checks that transcribing a known file does not exceed a time limit (this is tricky due to variability and if it calls external API, not reliable). Instead, you might test that certain operations don’t blow up memory by, say, running a loop of 10 transcriptions on a short file and seeing it doesn’t accumulate memory (this can be done with tools or just by ensuring no exceptions in such a loop). These are not standard unit tests, but if performance is a core concern, you can integrate some checks. - **Mock External Dependencies:** As mentioned, use monkeypatch to intercept calls to `openai.Audio.transcribe` and `openai.ChatCompletion.create`. For pydub, you might not need to mock it if using real tiny audio files. Pydub should handle a few milliseconds file quickly. Alternatively, you can create an `AudioSegment` from pure tone for a short duration in-memory (pydub can generate sine waves, etc., but that’s an extra complexity). Using a real WAV of 1 second of silence could suffice to test the pipeline. - **Environment and Config:** Use `.env` or environment variables to load API keys in your app. In tests, you can either set dummy keys (since you won’t actually call the API if mocking) or ensure the code doesn’t require a real key for test mode. One way is to design your OpenAI call wrapper to only call if a global `USE_OPENAI` is True, otherwise use a stub. But simpler is just monkeypatch, as repeated above. Pytest’s monkeypatch fixture can set `openai.api_key` to a fake value to avoid warnings. - **Testing Conversation Logic:** Simulate a multi-turn conversation in a test by calling your conversation handling function multiple times with different inputs (some text, some audio). Ensure the history is accumulating and that older entries are trimmed when they should be. You can enforce a small max history for test (maybe inject a parameter to limit tokens) to test the truncation. By following these testing practices, you’ll catch regressions and ensure that adding MCP didn’t break existing functionality (for instance, the program should still work if used purely in text mode). Since the project was structured with tests and documentation, make sure to update docs to explain new features (like “now supports MCP” or how to use voice input). Document any new environment variables (maybe an Anthropic Claude key if you were to use it, or other config toggles for using GPT-4o vs Whisper). ## Integrating MCP into the Existing Audio-GPT Codebase Finally, let’s outline how to practically integrate all this into your project, step by step, and provide some code snippets illustrating the combined system. The existing project is a CLI for processing audio with Whisper and managing a GPT-4 conversation. We will introduce MCP while preserving current capabilities. ### Step 1: Install and Import MCP First, install the MCP Python SDK: `pip install mcp`. This gives you access to the `mcp` package and the FastMCP utilities ([mcp · PyPI](https://pypi.org/project/mcp/#:~:text=uv%20add%20)) ([mcp · PyPI](https://pypi.org/project/mcp/#:~:text=pip%20install%20mcp)). In your code, import what’s needed: ```python # In server.py or similar from mcp.server.fastmcp import FastMCP ``` and for a client if needed: ```python from mcp import ClientSession, StdioServerParameters from mcp.client.stdio import stdio_client ``` Ensure `requirements.txt` (or pyproject) is updated to include `mcp`. ### Step 2: Define MCP Server with Tools As discussed, create a new MCP server instance and add tools. For example, in `mcp_server.py`: ```python import os import openai from pydub import AudioSegment from mcp.server.fastmcp import FastMCP mcp = FastMCP("AudioGPT") @mcp.tool() def transcribe_audio(file_path: str) -> str: """Transcribe an audio file to text using Whisper.""" if not os.path.exists(file_path): raise ValueError(f"File not found: {file_path}") # Optionally, enforce size limit to avoid huge files file_size = os.path.getsize(file_path) if file_size > 25 * 1024 * 1024: # 25 MB raise ValueError("Audio file too large to transcribe") # Convert to required format if needed audio = AudioSegment.from_file(file_path) audio = audio.set_frame_rate(16000).set_channels(1) temp_wav = "/tmp/conv_audio.wav" audio.export(temp_wav, format="wav") # Call OpenAI Whisper API with open(temp_wav, "rb") as f: response = openai.Audio.transcribe("whisper-1", f) text = response.get("text", "") return text # Potentially add more tools, e.g., a tool to get conversation summary or to do TTS. ``` *(In practice, consider generating a unique temp filename per call to avoid collisions in concurrent calls; or use an in-memory bytes buffer via `BytesIO` with AudioSegment export.)* You might also add a **resource** if you want to expose raw files. For example, if you want the AI to be able to read text from certain files, you could do `@mcp.resource("file://{path}")` and implement it. But for now, audio is handled via the tool. After defining tools, you need to run the server. For CLI usage, you have options: run it in-process (which would block if you call `mcp.run()` since it enters a loop), or run it as a background process. One pattern: use Python’s multiprocessing or threading to start the server. E.g., in your main CLI, if `args.mcp` is true, start a thread that runs `mcp.serve_forever()` or similar. The MCP FastMCP likely has a method to start listening on stdio or other transports. Actually, since FastMCP is designed to run inside Claude Desktop (which manages the process), running it standalone might involve something like: ```python if __name__ == "__main__": mcp.run() # Hypothetical method to serve via stdio for dev. ``` The documentation suggests using `mcp dev server.py` command for testing, which probably calls a run internally ([mcp · PyPI](https://pypi.org/project/mcp/#:~:text=You%20can%20install%20this%20server,it%20right%20away%20by%20running)). For integration, you might not even need to manually start it if you use the client with `stdio_client` – that will spawn the process given the command. For example, you can do: ```python # In main logic server_params = StdioServerParameters(command="python", args=["server.py"]) async with stdio_client(server_params) as (read, write): async with ClientSession(read, write) as session: await session.initialize() # ... use session to call tools ``` This spawns the `mcp_server.py` in a sub-process and connects. If you prefer not to have an extra process, you could incorporate the server loop in the same process using threads. However, using `stdio_client` as intended might be cleaner and isolates the server (so if a tool crashes, it won’t crash your main program). ### Step 3: Incorporate MCP Tool Calls in Conversation Flow Now, modify your conversation loop to use the MCP tool instead of directly calling Whisper. If the user input is audio (detected by file extension or a command flag), do the following: connect to the MCP server (or if persistent, ensure it’s running) and call the `transcribe_audio` tool via the session. For example, continuing from the above client snippet: ```python user_input = "path/to/input.m4a" # assume this comes from CLI args if user_input.endswith((".mp3", ".wav", ".m4a", ".ogg")): # It's an audio file, use MCP tool to transcribe result = await session.call_tool("transcribe_audio", arguments={"file_path": user_input}) transcript = result # should be the returned text conversation_history.append({"role": "user", "content": transcript}) else: # Text input conversation_history.append({"role": "user", "content": user_input}) # Now get GPT response openai_resp = openai.ChatCompletion.create(model="gpt-4", messages=conversation_history) assistant_text = openai_resp["choices"][0]["message"]["content"] conversation_history.append({"role": "assistant", "content": assistant_text}) print(f"Assistant: {assistant_text}") ``` This pseudo-code shows integration: If audio, call the MCP tool (which behind the scenes uses Whisper). If text, use directly. Then call GPT-4 as usual with the updated history. You’d wrap this in an async function since session is async. In a CLI, you might run `asyncio.run(main())` to execute. If your CLI is synchronous, you can mix by running event loop for that part or by using a synchronous interface (the SDK might also offer sync versions, or you can use `asyncio.get_event_loop().run_until_complete`). A key part is how `session.call_tool` returns data. Likely it returns the tool’s output directly (in our case a string) ([GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients](https://github.com/modelcontextprotocol/python-sdk#:~:text=,name%22%2C%20arguments%3D%7B%22arg1%22%3A%20%22value)). So we get the transcript text easily. If the tool threw an error, the SDK might raise an exception or return an error result object. So put this in try/except to handle ValueError from our tool (e.g., file not found or too large). You can then print that error or send a message. ### Step 4: Conversation Management Enhancements Your existing code already manages conversation history. Ensure now that if multiple audio inputs are given sequentially, it continues to append properly. Also, consider a **conversation reset or new session** functionality (maybe a command-line flag or a special user input like "reset"). That would clear the history list and start fresh. This is useful after a long conversation to avoid endlessly growing context or mixing separate topics. Additionally, if you plan to support GPT-4o with direct audio, you could add a mode for that. For example, `--model gpt-4o` could skip Whisper: ```python if args.model == "gpt-4o": # Use direct API call with audio with open(user_input, "rb") as f: openai_resp = openai.ChatCompletion.create( model="gpt-4o-audio-preview", # hypothetical model name messages=[{"role": "user", "content": None}], # content None or empty file=f # perhaps the file is passed separately ) transcript = openai_resp["something"]["transcription"] assistant_text = openai_resp["choices"][0]["message"]["content"] # Now handle as one step: user said X (transcript), assistant answered. conversation_history.append({"role": "user", "content": transcript}) conversation_history.append({"role": "assistant", "content": assistant_text}) ``` This pseudocode is speculative because OpenAI’s exact API for audio in GPT-4o might differ. But the idea is: GPT-4o can do transcription + answer in one call. If you get the transcript out (some APIs might not explicitly give it, in which case you might have GPT-4o respond with something like "You asked about Y..." which implies what it heard), you should capture it. Otherwise, the conversation_history might just have the assistant’s answer and the user’s turn could be an empty placeholder, which is not ideal. Until the GPT-4o API is clearly documented, sticking to explicit Whisper transcription might be more deterministic. ### Step 5: Testing and Validation Run your test suite. All previous tests for whisper transcription might need updating to reflect that transcription is now via the MCP tool. If those tests were mocking Whisper, you can instead mock `transcribe_audio` tool. You could also still directly test your whisper integration by calling `transcribe_audio` function (since it’s just a normal function when invoked directly). Ensure new tests cover the client-server interaction if possible. For example, you might spin up the MCP server in a thread in a test, then call the client function to transcribe a small audio. This is integration-heavy, but you could mark it as such or use a fixture to manage the server lifecycle during tests. Check performance with some sample inputs. If you have large audio files, test the behavior (the tool should refuse >25MB as coded, which is good to avoid long hang). Test various audio formats (mp3, wav, etc.) to ensure pydub handles them given ffmpeg availability. Document in your README how to use the new features. For instance: “You can now provide an audio file as input; the program will transcribe it using Whisper (via MCP) and include it in the GPT-4 conversation context automatically.” Mention any new dependencies (ffmpeg for pydub, though users likely had that for Whisper anyway, or the MCP library). ### Deliverables Recap By following this integration plan, you achieve: - **MCP-based audio context management:** The `transcribe_audio` tool encapsulates audio transcription logic, providing a clear interface between the LLM and audio data. This means your audio processing can now be accessed via the MCP protocol by any compatible AI client, not just your internal code. (E.g., one could plug your server into Claude Desktop and it would list a “transcribe_audio” tool automatically, thanks to MCP’s introspection). - **Conversation state handling:** We maintain a conversation history list of messages, appending user transcripts and model replies each turn ([Openai-Python Remember Conversation | Restackio](https://www.restack.io/p/openai-python-answer-remember-conversation-cat-ai#:~:text=2,API%20request%20to%20provide%20context)) ([Openai-Python Remember Conversation | Restackio](https://www.restack.io/p/openai-python-answer-remember-conversation-cat-ai#:~:text=4,reply%20to%20the%20conversation%20history)). We implement best practices like limiting context size and possibly summarizing or resetting as needed ([Openai-Python Remember Conversation | Restackio](https://www.restack.io/p/openai-python-answer-remember-conversation-cat-ai#:~:text=,stored%20in%20the%20conversation%20history)). This ensures GPT-4 has the necessary context for multi-turn conversations, and that context is preserved across modality boundaries. - **Integration with Whisper and GPT-4o:** We provided code examples using OpenAI’s Whisper API via an MCP tool ([Making transcriptions using OpenAI's Whisper - Tilburg Science Hub](https://tilburgsciencehub.com/topics/automation/ai/transcription/whisper/#:~:text=)). We also discussed how GPT-4o might be directly used for audio input ([Hello GPT-4o - OpenAI](https://openai.com/index/hello-gpt-4o/#:~:text=Hello%20GPT,and%20text%20in%20real%20time)). The system is designed such that switching the model or transcription method is modular – e.g., you could replace the internals of `transcribe_audio` with a call to a local library or a different API without affecting the rest of the conversation logic. We also have the flexibility to handle audio either in a separate step (Whisper) or end-to-end (GPT-4o) depending on availability, with the conversation state updated appropriately in both cases. In terms of **performance**, we expect the overhead of MCP to be minimal compared to the heavy lifting of ML inference. By using asynchronous calls and possibly streaming, the user experience can remain smooth. If needed, you can benchmark a round-trip of calling `transcribe_audio` on a small file to verify it’s, say, only a few milliseconds plus the transcription time. If you find any bottlenecks, they will likely lie in the transcription or model API, where solutions would be to batch requests or use more efficient models (e.g., `whisper-1` is already quite optimized; a smaller model could be faster but less accurate). **Optimization Strategies Recap:** Use caching for repeat audio ([path - How to successfully transcribe audio files using Whisper for OpenAI in Python? - Stack Overflow](https://stackoverflow.com/questions/76366387/how-to-successfully-transcribe-audio-files-using-whisper-for-openai-in-python#:~:text=They%20also%20recommend%20using%20AudioSegment,the%20audio%20file%20in%20pieces)), limit conversation length ([Openai-Python Remember Conversation | Restackio](https://www.restack.io/p/openai-python-answer-remember-conversation-cat-ai#:~:text=,Retaining)), and possibly utilize the multi-modal capability of GPT-4o to reduce steps when available. Clean up resources to avoid memory leaks (temp files, lingering processes). With these in place, your integrated system should handle audio-to-text and text-to-text seamlessly, leveraging MCP for a clean separation of concerns. The result is a more extensible codebase: for instance, in the future you could add an MCP tool for, say, fetching definitions from a dictionary or searching the web, and your GPT-4 could call it – effectively you’d be growing your system into an **MCP-powered multi-tool assistant**, all within the robust framework you’ve set up. ### References - Anthropic, *Introducing the Model Context Protocol (MCP)* – open standard for connecting AI to data sources ([Introducing the Model Context Protocol \ Anthropic](https://www.anthropic.com/news/model-context-protocol#:~:text=The%20Model%20Context%20Protocol%20is,that%20connect%20to%20these%20servers)) ([Introducing the Model Context Protocol \ Anthropic](https://www.anthropic.com/news/model-context-protocol#:~:text=MCP%20addresses%20this%20challenge,to%20the%20data%20they%20need)) - Raygun Blog, *Engineering AI systems with MCP* – outlines MCP hosts vs servers, tools/resources/prompts ([Engineering AI systems with Model Context Protocol · Raygun Blog](https://raygun.com/blog/announcing-mcp/#:~:text=MCP%20consists%20of%20two%20components%3A,but%20still%20supports%20remote%20APIs)) ([Engineering AI systems with Model Context Protocol · Raygun Blog](https://raygun.com/blog/announcing-mcp/#:~:text=,to%20create%20standardized%20commit%20messages)) - MCP Python SDK Documentation – code examples for client session and tool/resource definitions ([GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients](https://github.com/modelcontextprotocol/python-sdk#:~:text=The%20MCP%20protocol%20defines%20three,primitives%20that%20servers%20can%20implement)) ([mcp · PyPI](https://pypi.org/project/mcp/#:~:text=,return%20a%20%2B%20b)) - Simon Willison, *MCP via Claude Desktop* – notes JSON-RPC over stdio for MCP servers ([Introducing the Model Context Protocol](https://simonwillison.net/2024/Nov/25/model-context-protocol/#:~:text=Their%20first%20working%20version%20of,standard%20input%20and%20standard%20output)) - OpenAI Whisper API usage – sample code for transcribing audio files ([Making transcriptions using OpenAI's Whisper - Tilburg Science Hub](https://tilburgsciencehub.com/topics/automation/ai/transcription/whisper/#:~:text=)) - OpenAI Developer Guide – managing conversation history with chat models ([Openai-Python Remember Conversation | Restackio](https://www.restack.io/p/openai-python-answer-remember-conversation-cat-ai#:~:text=2,API%20request%20to%20provide%20context)) ([Openai-Python Remember Conversation | Restackio](https://www.restack.io/p/openai-python-answer-remember-conversation-cat-ai#:~:text=4,reply%20to%20the%20conversation%20history)) - Stack Overflow – best practice to chunk audio with pydub for large files ([path - How to successfully transcribe audio files using Whisper for OpenAI in Python? - Stack Overflow](https://stackoverflow.com/questions/76366387/how-to-successfully-transcribe-audio-files-using-whisper-for-openai-in-python#:~:text=They%20also%20recommend%20using%20AudioSegment,the%20audio%20file%20in%20pieces)) - OpenAI announcement of GPT-4o – multi-modal (text, image, **audio**) capabilities ([Hello GPT-4o - OpenAI](https://openai.com/index/hello-gpt-4o/#:~:text=Hello%20GPT,and%20text%20in%20real%20time)).

Latest Blog Posts

What Is Context Bloat in MCP?
By Om-Shree-0709 on December 16, 2025.
mcp
Context Bloat
MCP Moves to the Linux Foundation: Neutral Stewardship for Agentic Infrastructure
By Om-Shree-0709 on December 15, 2025.
mcp
anthropic
Linux Foundation
Code Execution with MCP: Architecting Agentic Efficiency
By Om-Shree-0709 on December 14, 2025.
mcp
Token bloat

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/arcaputo3/mcp-server-whisper'

If you have feedback or need assistance with the MCP directory API, please join our Discord server