Server Configuration
Describes the environment variables required to run the server.
| Name | Required | Description | Default |
|---|---|---|---|
| GEMINI_API_KEY | No | Your API key from aistudio.google.com (required for Gemini cloud backend) | |
| DELIA_JWT_SECRET | No | Your secure secret for JWT authentication | |
| DELIA_AUTH_ENABLED | No | Enable authentication for HTTP mode with multiple users |
Tools
Functions exposed to the LLM to take actions
| Name | Description |
|---|---|
| delegate | Execute a task on local/remote GPU with intelligent 3-tier model selection. Routes to optimal backend based on content size, task type, and GPU availability. WHEN TO USE:
Args: task: Task type determines model tier: - "quick" or "summarize" → quick tier (fast, 14B model) - "generate", "review", "analyze" → coder tier (code-optimized 14B) - "plan", "critique" → moe tier (deep reasoning 30B+) content: The prompt or content to process (required) file: Optional file path to include in context model: Force specific tier - "quick" | "coder" | "moe" | "thinking" OR natural language: "7b", "14b", "30b", "small", "large", "coder model", "fast", "complex", "thinking" language: Language hint for better prompts - python|typescript|react|nextjs|rust|go context: Serena memory names to include (comma-separated: "architecture,decisions") symbols: Code symbols to focus on (comma-separated: "Foo,Bar/calculate") include_references: True if content includes symbol usages from elsewhere backend_type: Force backend type - "local" | "remote" (default: auto-select) ROUTING LOGIC:
Returns: LLM response with metadata footer showing model, tokens, time, backend Examples: delegate(task="review", content="", language="python") delegate(task="generate", content="Write a REST API", backend_type="local") delegate(task="plan", content="Design caching strategy", model="moe") delegate(task="analyze", content="Debug this error", model="14b") delegate(task="quick", content="Summarize this article", model="fast") |
| think | Deep reasoning for complex problems using local GPU with extended thinking. Offloads complex analysis to local LLM - zero API costs. WHEN TO USE:
Args: problem: The problem or question to think through (required) context: Supporting information - code, docs, constraints (optional) depth: Reasoning depth level: - "quick" → Fast answer, no extended thinking (14B model) - "normal" → Balanced reasoning with thinking (14B coder) - "deep" → Thorough multi-step analysis (30B+ MoE model) ROUTING:
Returns: Structured analysis with step-by-step reasoning and conclusions Examples: think(problem="How should we handle authentication?", depth="deep") think(problem="Debug this error", context="", depth="normal") |
| batch | Execute multiple tasks in PARALLEL across all available GPUs for maximum throughput. Distributes work across local and remote backends intelligently. WHEN TO USE:
Args: tasks: JSON string containing an array of task objects. Each object can have: - task: "quick"|"summarize"|"generate"|"review"|"analyze"|"plan"|"critique" - content: The content to process (required) - file: Optional file path - model: Force tier - "quick"|"coder"|"moe" - language: Language hint for code tasks ROUTING LOGIC:
Returns: Combined results from all tasks with timing and routing info Example: batch('[ {"task": "summarize", "content": "doc1..."}, {"task": "review", "content": "code2...", "language": "python"}, {"task": "analyze", "content": "log3..."} ]') |
| health | Check health status of Delia and all configured GPU backends. Only checks backends that are enabled in settings.json. Shows availability, loaded models, usage stats, and cost savings. WHEN TO USE:
Returns: JSON with: - status: "healthy" | "degraded" | "unhealthy" - backends: Array of configured backend status - usage: Token counts and call statistics per tier - cost_savings: Estimated savings vs cloud API - routing: Current routing configuration |
| queue_status | Get current status of the model queue system. Shows loaded models, queued requests, and GPU memory usage. Useful for monitoring queue performance and debugging loading issues. Returns: JSON with queue status, loaded models, and pending requests |
| models | List all configured models across all GPU backends. Shows model tiers (quick/coder/moe) and which are currently loaded. WHEN TO USE:
Returns: JSON with: - backends: All configured backends with their models - currently_loaded: Models in GPU memory (no load time) - selection_logic: How tasks map to model tiers |
| switch_backend | Switch the active LLM backend. Args: backend_id: ID of the backend to switch to (from settings.json) Returns: Confirmation message with current status |
| switch_model | Switch the model for a specific tier at runtime. This allows dynamic model experimentation without restarting the server. Changes are persisted to settings.json for consistency across restarts. Args: tier: Model tier to change - "quick", "coder", "moe", or "thinking" model_name: New model name (must be available in the current backend) Returns: Confirmation with model change details and availability status |
| get_model_info_tool | Get detailed information about a specific model. Returns VRAM requirements, context window size, and tier classification. For configured models, shows exact values. For unknown models, provides estimates. Args: model_name: Name of the model to get info for (e.g., "qwen2.5:14b", "llama3.1:70b") Returns: Formatted model information including VRAM, context, and tier |
Prompts
Interactive templates invoked by user choice
| Name | Description |
|---|---|
No prompts | |
Resources
Contextual data attached and managed by the client
| Name | Description |
|---|---|
No resources | |