Skip to main content
Glama

MCP Server with Local LLM

A Model Context Protocol (MCP) server that integrates with your local Large Language Model, providing tools and resources that can be used by MCP-compatible clients like Claude Desktop, IDEs, and other AI assistants.

Features

  • Local LLM Integration: Uses your local language model (default: Qwen3-8B) for all operations

  • MCP-Compatible: Implements the Model Context Protocol specification

  • Multiple Tools: Provides various tools for different use cases

  • Resource Access: Exposes model information and conversation history

  • Memory Efficient: Uses 4-bit quantization for optimal performance

Available Tools

1. chat_with_llm

Basic chat interface with the local LLM.

Parameters:

  • message (required): The message to send to the LLM

  • system_prompt (optional): System prompt to guide behavior

  • temperature (optional): Temperature for generation (0-1, default: 0.7)

  • max_tokens (optional): Maximum tokens to generate (default: 512)

2. analyze_code

Analyze and review code using the local LLM.

Parameters:

  • code (required): The code to analyze

  • language (optional): Programming language

  • analysis_type (optional): Type of analysis (review, explain, optimize, debug)

3. generate_text

Generate text content in various styles and lengths.

Parameters:

  • prompt (required): The prompt for text generation

  • style (optional): Writing style (formal, casual, technical, creative)

  • length (optional): Desired length (short, medium, long)

4. translate_text

Translate text between languages.

Parameters:

  • text (required): Text to translate

  • target_language (required): Target language

  • source_language (optional): Source language (auto-detect if not provided)

5. summarize_content

Summarize content in different formats.

Parameters:

  • content (required): Content to summarize

  • summary_type (optional): Type of summary (brief, detailed, bullet_points)

  • max_length (optional): Maximum length in words (default: 150)

Available Resources

1. llm://model/info

Information about the loaded local LLM model.

2. llm://conversations/history

History of conversations with the LLM.

3. llm://system/status

Current system status and resource usage.

Installation

  1. Clone or download this repository to your local machine.

  2. Install dependencies:

    pip install -r requirements.txt
  3. Set up environment variables (create a .env file):

    MODEL_NAME=Qwen/Qwen3-8B MAX_NEW_TOKENS=512 TEMPERATURE=0.7
  4. Ensure you have sufficient hardware:

    • GPU with at least 8GB VRAM (recommended)

    • 16GB+ system RAM

    • CUDA-compatible GPU (for optimal performance)

Usage

Running the Server

  1. Start the MCP server:

    python mcp_server.py
  2. The server will:

    • Load the specified model (Qwen3-8B by default)

    • Initialize the MCP server

    • Wait for JSON-RPC requests via stdin/stdout

Integrating with Claude Desktop

To use this server with Claude Desktop, add it to your MCP configuration:

  1. Create or edit your Claude Desktop MCP configuration file:

    • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json

    • Windows: %APPDATA%\Claude\claude_desktop_config.json

  2. Add the server configuration:

    { "mcpServers": { "local-llm": { "command": "python", "args": ["/Users/sazzad/Documents/mcp-server/mcp_server.py"], "env": { "MODEL_NAME": "Qwen/Qwen3-8B" } } } }
  3. Restart Claude Desktop to load the new server.

Example Usage in Claude Desktop

Once configured, you can use the tools in Claude Desktop:

Please use the local LLM to analyze this Python code:

Claude will then use the analyze_code tool to get analysis from your local LLM.

Configuration

Environment Variables

  • MODEL_NAME: HuggingFace model name (default: "Qwen/Qwen3-8B")

  • MAX_NEW_TOKENS: Maximum tokens to generate (default: 512)

  • TEMPERATURE: Default temperature for generation (default: 0.7)

Supported Models

This server works with most HuggingFace transformers models that support:

  • Chat templates

  • 4-bit quantization via BitsAndBytesConfig

  • Causal language modeling

Popular choices:

  • Qwen/Qwen3-8B (default)

  • microsoft/DialoGPT-medium

  • meta-llama/Llama-2-7b-chat-hf

  • mistralai/Mistral-7B-Instruct-v0.1

Hardware Requirements

Minimum Requirements

  • 8GB RAM

  • 4GB GPU VRAM (with 4-bit quantization)

  • 10GB disk space (for model storage)

  • 16GB+ RAM

  • 8GB+ GPU VRAM

  • CUDA-compatible GPU

  • SSD storage for faster model loading

Troubleshooting

Common Issues

  1. Out of Memory Error:

    • Reduce MAX_NEW_TOKENS

    • Use a smaller model

    • Ensure 4-bit quantization is working

  2. Model Loading Fails:

    • Check internet connection for initial download

    • Verify model name is correct

    • Ensure sufficient disk space

  3. CUDA Errors:

    • Update CUDA drivers

    • Check GPU compatibility

    • Fall back to CPU if needed

Performance Optimization

  1. Use GPU acceleration:

    pip install torch --index-url https://download.pytorch.org/whl/cu118
  2. Enable Flash Attention (if compatible):

    pip install flash-attn
  3. Adjust batch size and sequence length based on your hardware.

Development

Project Structure

mcp-server/ ├── mcp_server.py # Main MCP server implementation ├── mcp-server.py # Legacy FastAPI server (keep for reference) ├── requirements.txt # Python dependencies ├── mcp/ # MCP protocol implementation │ ├── __init__.py │ ├── server.py # MCP server base │ └── types.py # MCP protocol types ├── .env # Environment variables (create this) └── README.md # This file

Contributing

  1. Fork the repository

  2. Create a feature branch

  3. Make your changes

  4. Test thoroughly

  5. Submit a pull request

License

MIT License - see LICENSE file for details.

Support

For issues and questions:

  1. Check the troubleshooting section

  2. Review the configuration

  3. Open an issue on GitHub

-
security - not tested
F
license - not found
-
quality - not tested

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/0xsaju/mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server