Local AI MCP
OfficialAllows managing local Ollama models: discover installed and loaded models, pull/remove models, load/unload from memory, check hardware fit, and offload inference (completion and embedding) to local Ollama instances.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Local AI MCPlist installed models"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Local AI MCP
Unified MCP server for managing local model runtimes (Ollama, LM Studio, and more): provider-agnostic discovery, lifecycle, hardware-fit, and delegated inference.
Local AI MCP is an MCP server that turns your local model runtimes into an agent-callable control plane. It is operations-first: its primary job is to discover, inspect, fit, and manage the models running on your own machine. It speaks to runtimes over their local HTTP APIs and exposes one consistent tool surface across them, so an agent does not need to know whether a model lives in Ollama or LM Studio.
The server communicates over stdio only. It is a client to your local runtimes and never opens a network listener of its own.
Why an ops-first local-model server
Discovery and lifecycle, not just chat. List what is installed, what is loaded, pull and remove models, load and unload them, and check their fit against your hardware before you commit VRAM to them.
Hardware-aware.
system_resourcesandfit_checkread your real RAM and GPU/VRAM so an agent can pick a model that will actually run, andsuggest_modelranks candidates by task and by what fits.Provider-agnostic. Every tool takes an optional
providerargument. Omit it and the tool operates across all detected runtimes, aggregating results per provider.
Related MCP server: Ollama MCP Server
Inference is delegation, not chat
The complete and embed tools exist to delegate (offload) inference to a local model for cost control and privacy: keep tokens and data on your own hardware instead of sending them to a hosted API. They are deliberately framed as delegated/offloaded inference primitives, not as a conversational chat surface.
The provider-adapter model
Each runtime is implemented as an adapter behind a single Provider interface (src/providers/types.ts) with a uniform method set: detect, health, listModels, listLoaded, modelInfo, pull, remove, load, unload, complete, embed, and capabilities. Adding a runtime means adding one adapter; the tool layer is unchanged.
Adapter | Default host | Transport | Notes |
Ollama ( |
| Native REST + OpenAI-compatible |
|
LM Studio ( |
| REST ( | Uses the |
Auto-detection: on each call the server probes the configured local endpoints to determine which runtimes are live. Hardware probing is isolated in src/hardware/ and branches by platform (Windows / Linux); it exposes total/free RAM and, where detectable, GPU name and VRAM.
Tool surface (16 tools)
Discovery
Tool | Description |
| Configured runtimes, their host, live/detected status, and capabilities. |
| Installed models across detected providers (or one provider). |
| Models currently resident in memory. |
| Detailed metadata for a model. |
Lifecycle
Tool | Description |
| Download a model. Heavy: may transfer multiple GB. |
| Delete a model from disk. Destructive: requires |
| Load a model into memory (Ollama |
| Evict a model from memory. |
Ops
Tool | Description |
| Liveness and version per provider. |
| Total/free RAM, CPU count, and GPU/VRAM. |
| Whether a model fits in free VRAM (GPU) or RAM (CPU), with the numbers. |
| Measure latency and tokens/sec with one small completion. Heavy: runs real inference. |
Registry
Tool | Description |
| Search a curated catalog of well-known models (Ollama library oriented). |
| Recommend a model for a task, ranked by what fits your detected hardware. |
Delegation (offloaded inference)
Tool | Description |
| Delegate a completion to a local model (cost/privacy offload, not chat). |
| Delegate embedding generation to a local model. |
Every tool except system_resources accepts an optional provider (ollama | lmstudio). Omit it to operate across all detected runtimes.
Install and run
npx @tmhs/local-ai-mcpClaude Desktop / Cursor config
{
"mcpServers": {
"local-ai": {
"command": "npx",
"args": ["-y", "@tmhs/local-ai-mcp"],
"env": {
"OLLAMA_HOST": "http://localhost:11434",
"LMSTUDIO_HOST": "http://localhost:1234"
}
}
}
}Configuration
All configuration is via environment variables with sane defaults:
Variable | Default | Description |
|
| Ollama base URL (scheme optional; added if missing). |
|
| LM Studio base URL. |
|
| Timeout for normal requests (inference, pull progress, etc.). |
|
| Timeout for provider auto-detection probes. |
Development
npm install
npm run build # tsc -> dist/
npm test # vitest; runs fully offline (mocked HTTP, stubbed hardware)The test suite requires no running runtime and no downloaded model: every HTTP call is mocked and hardware probing is stubbed.
License
CC-BY-NC-ND-4.0 -- see LICENSE.
Built by TMHSDigital
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/TMHSDigital/local-ai-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server