Skip to main content
Glama

Server Configuration

Describes the environment variables required to run the server.

NameRequiredDescriptionDefault
GEMINI_API_KEYNoYour API key from aistudio.google.com (required for Gemini cloud backend)
DELIA_JWT_SECRETNoYour secure secret for JWT authentication
DELIA_AUTH_ENABLEDNoEnable authentication for HTTP mode with multiple users

Tools

Functions exposed to the LLM to take actions

NameDescription
delegate

Execute a task on local/remote GPU with intelligent 3-tier model selection. Routes to optimal backend based on content size, task type, and GPU availability.

WHEN TO USE:

  • "locally", "on my GPU", "without API", "privately" → Use this tool

  • Code review, generation, analysis tasks → Use this tool

  • Any task you want processed on local hardware → Use this tool

Args: task: Task type determines model tier: - "quick" or "summarize" → quick tier (fast, 14B model) - "generate", "review", "analyze" → coder tier (code-optimized 14B) - "plan", "critique" → moe tier (deep reasoning 30B+) content: The prompt or content to process (required) file: Optional file path to include in context model: Force specific tier - "quick" | "coder" | "moe" | "thinking" OR natural language: "7b", "14b", "30b", "small", "large", "coder model", "fast", "complex", "thinking" language: Language hint for better prompts - python|typescript|react|nextjs|rust|go context: Serena memory names to include (comma-separated: "architecture,decisions") symbols: Code symbols to focus on (comma-separated: "Foo,Bar/calculate") include_references: True if content includes symbol usages from elsewhere backend_type: Force backend type - "local" | "remote" (default: auto-select)

ROUTING LOGIC:

  1. Content > 32K tokens → Uses backend with largest context window

  2. Prefer local GPUs (lower latency) unless unavailable

  3. Falls back to remote if local circuit breaker is open

  4. Load balances across available backends based on priority weights

Returns: LLM response with metadata footer showing model, tokens, time, backend

Examples: delegate(task="review", content="", language="python") delegate(task="generate", content="Write a REST API", backend_type="local") delegate(task="plan", content="Design caching strategy", model="moe") delegate(task="analyze", content="Debug this error", model="14b") delegate(task="quick", content="Summarize this article", model="fast")

think

Deep reasoning for complex problems using local GPU with extended thinking. Offloads complex analysis to local LLM - zero API costs.

WHEN TO USE:

  • Complex multi-step problems requiring careful reasoning

  • Architecture decisions, trade-off analysis

  • Debugging strategies, refactoring plans

  • Any situation requiring "thinking through" before acting

Args: problem: The problem or question to think through (required) context: Supporting information - code, docs, constraints (optional) depth: Reasoning depth level: - "quick" → Fast answer, no extended thinking (14B model) - "normal" → Balanced reasoning with thinking (14B coder) - "deep" → Thorough multi-step analysis (30B+ MoE model)

ROUTING:

  • Uses largest available GPU for deep thinking

  • Automatically enables thinking mode for normal/deep

  • Prefers local GPU, falls back to remote if needed

Returns: Structured analysis with step-by-step reasoning and conclusions

Examples: think(problem="How should we handle authentication?", depth="deep") think(problem="Debug this error", context="", depth="normal")

batch

Execute multiple tasks in PARALLEL across all available GPUs for maximum throughput. Distributes work across local and remote backends intelligently.

WHEN TO USE:

  • Processing multiple files/documents simultaneously

  • Bulk code review, summarization, or analysis

  • Any workload that can be parallelized

Args: tasks: JSON string containing an array of task objects. Each object can have: - task: "quick"|"summarize"|"generate"|"review"|"analyze"|"plan"|"critique" - content: The content to process (required) - file: Optional file path - model: Force tier - "quick"|"coder"|"moe" - language: Language hint for code tasks

ROUTING LOGIC:

  • Distributes tasks across ALL available GPUs (local + remote)

  • Large content (>32K tokens) → Routes to backend with sufficient context

  • Normal content → Round-robin for parallel execution

  • Respects backend health and circuit breakers

Returns: Combined results from all tasks with timing and routing info

Example: batch('[ {"task": "summarize", "content": "doc1..."}, {"task": "review", "content": "code2...", "language": "python"}, {"task": "analyze", "content": "log3..."} ]')

health

Check health status of Delia and all configured GPU backends.

Only checks backends that are enabled in settings.json. Shows availability, loaded models, usage stats, and cost savings.

WHEN TO USE:

  • Verify backends are available before delegating

  • Check which models are currently loaded

  • Monitor usage statistics and cost savings

  • Diagnose connection issues

Returns: JSON with: - status: "healthy" | "degraded" | "unhealthy" - backends: Array of configured backend status - usage: Token counts and call statistics per tier - cost_savings: Estimated savings vs cloud API - routing: Current routing configuration

queue_status

Get current status of the model queue system.

Shows loaded models, queued requests, and GPU memory usage. Useful for monitoring queue performance and debugging loading issues.

Returns: JSON with queue status, loaded models, and pending requests

models

List all configured models across all GPU backends. Shows model tiers (quick/coder/moe) and which are currently loaded.

WHEN TO USE:

  • Check which models are available for tasks

  • Verify model configuration across backends

  • Understand task-to-model routing logic

Returns: JSON with: - backends: All configured backends with their models - currently_loaded: Models in GPU memory (no load time) - selection_logic: How tasks map to model tiers

switch_backend

Switch the active LLM backend.

Args: backend_id: ID of the backend to switch to (from settings.json)

Returns: Confirmation message with current status

switch_model

Switch the model for a specific tier at runtime.

This allows dynamic model experimentation without restarting the server. Changes are persisted to settings.json for consistency across restarts.

Args: tier: Model tier to change - "quick", "coder", "moe", or "thinking" model_name: New model name (must be available in the current backend)

Returns: Confirmation with model change details and availability status

get_model_info_tool

Get detailed information about a specific model.

Returns VRAM requirements, context window size, and tier classification. For configured models, shows exact values. For unknown models, provides estimates.

Args: model_name: Name of the model to get info for (e.g., "qwen2.5:14b", "llama3.1:70b")

Returns: Formatted model information including VRAM, context, and tier

Prompts

Interactive templates invoked by user choice

NameDescription

No prompts

Resources

Contextual data attached and managed by the client

NameDescription

No resources

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/zbrdc/delia'

If you have feedback or need assistance with the MCP directory API, please join our Discord server