M.I.M.I.R - Multi-agent Intelligent Memory & Insight Repository

Overview Schema Related Servers Score Discussions

Mimir
docs
guides

QUANTIZED_TESTING_GUIDE.md•12.4 KiB

# Quantized Preamble Testing Guide This guide explains how to automatically test the quantized Claudette preamble against small language models running on your Ollama server. --- ## Overview The quantized testing suite (`test:quantized`) allows you to: - ✅ Test `claudette-quantized.md` against multiple small models (≤10B) - ✅ Compare performance with the original `claudette-auto.md` - ✅ Automatically filter out large models (>10B) and cloud APIs - ✅ Generate detailed comparison reports with scores and metrics - ✅ Validate behavioral parity across different model sizes --- ## Prerequisites ### 1. Ollama Server Running You need Ollama running locally or on a network server: ```bash # Check if Ollama is running curl http://localhost:11434/api/tags # Or check remote server curl http://192.168.1.167:11434/api/tags ``` If not running: - **Local**: Download from [ollama.ai](https://ollama.ai) and start - **Remote**: Ensure server is accessible on your network ### 2. Pull Recommended Models The test suite recommends these models (≤10B parameters): ```bash # Qwen models (1.5B - 7B) ollama pull qwen2.5-coder:1.5b ollama pull qwen2.5-coder:3b ollama pull qwen2.5-coder:7b ollama pull qwen2.5-coder:7b-instruct-q4_K_M # Phi models (3.8B) ollama pull phi3:mini # Gemma models (2B - 9B) ollama pull gemma2:2b ollama pull gemma2:9b # LLaMA 3.2 (1B - 3B) ollama pull llama3.2:1b ollama pull llama3.2:3b # DeepSeek Coder (1.3B - 6.7B) ollama pull deepseek-coder:1.3b ollama pull deepseek-coder:6.7b # TinyLLaMA (1.1B - baseline) ollama pull tinyllama:1.1b ``` **Note:** You don't need all models. The test suite will skip unavailable models automatically. --- ## Quick Start ### Basic Usage (Local Ollama) ```bash # Test with all recommended models on localhost npm run test:quantized ``` ### Connect to Remote Ollama Server ```bash # Test with remote Ollama server npm run test:quantized -- --server http://192.168.1.167:11434 ``` ### Test Specific Models ```bash # Test only small models npm run test:quantized -- --models qwen2.5-coder:1.5b,phi3:mini,gemma2:2b # Test with remote server + specific models npm run test:quantized -- --server http://192.168.1.167:11434 --models qwen2.5-coder:1.5b,qwen2.5-coder:7b ``` ### Test Only Quantized Preamble ```bash # Skip comparison with claudette-auto.md npm run test:quantized -- --preambles docs/agents/claudette-quantized.md ``` --- ## Command Line Options ``` Usage: npm run test:quantized [options] Options: --server <url> Ollama server URL (default: http://localhost:11434) --models <list> Comma-separated model names (default: all recommended) --preambles <list> Comma-separated preamble paths (default: quantized + auto) --output <dir> Output directory (default: quantized-test-results) --list-models, -l List recommended models (≤10B parameters) --help, -h Show this help Model Selection: - Only models ≤10B parameters are tested - Cloud models (GPT, Claude, Gemini) are automatically excluded - Large models (>10B) are automatically filtered out Examples: # Test with remote Ollama server npm run test:quantized -- --server http://192.168.1.167:11434 # Test specific models (will filter out any >10B) npm run test:quantized -- --models qwen2.5-coder:1.5b,phi3:mini # Test only quantized preamble npm run test:quantized -- --preambles docs/agents/claudette-quantized.md Environment: OLLAMA_BASE_URL Override default Ollama server URL ``` --- ## Understanding the Results ### Output Files The test suite generates: **Per-test results:** ``` quantized-test-results/ ├── 2025-11-01_claudette-quantized_qwen2.5-coder_1.5b.json ├── 2025-11-01_claudette-quantized_qwen2.5-coder_1.5b.md ├── 2025-11-01_claudette-auto_qwen2.5-coder_1.5b.json ├── 2025-11-01_claudette-auto_qwen2.5-coder_1.5b.md └── ... ``` **Comparison report:** ``` quantized-test-results/ └── 2025-11-01_comparison-report.md ``` ### Reading the Comparison Report **Results Summary Table:** ```markdown | Preamble | Model | Score | Tool Calls | Duration (s) | Status | | ------------------- | ------------------ | ------ | ---------- | ------------ | ------- | | claudette-quantized | qwen2.5-coder:1.5b | 85/100 | 5 | 12.3 | ✅ Pass | | claudette-auto | qwen2.5-coder:1.5b | 88/100 | 5 | 14.7 | ✅ Pass | ``` **Status Indicators:** - ✅ **Pass**: Score ≥80 (acceptable behavioral parity) - ⚠️ **Low**: Score <80 (degraded performance) - ❌ **Error**: Test failed to complete **Score Breakdown by Preamble:** ```markdown ### claudette-quantized **Average Score:** 83.5/100 **Average Tool Calls:** 5.2 | Model | Score | Tool Calls | Duration | | ------------------ | ------ | ---------- | -------- | | qwen2.5-coder:1.5b | 85/100 | 5 | 12.3s | | phi3:mini | 82/100 | 5 | 15.1s | ``` ### Interpreting Scores **Score Categories:** - **90-100**: Excellent - Full behavioral parity with original - **80-89**: Good - Acceptable parity, minor degradation - **70-79**: Fair - Noticeable degradation, may need optimization - **<70**: Poor - Significant degradation, not recommended **What Scores Measure:** 1. **Memory Protocol Adherence** (20 points) - Creates `.agents/memory.instruction.md` as first action - Structure matches template - Updates memory appropriately 2. **TODO Management** (20 points) - Creates TODO list with phases - References TODO throughout execution - Maintains context across conversation 3. **Autonomous Execution** (20 points) - Executes tools immediately after announcement - No "would you like me to proceed?" patterns - Continues until completion 4. **Repository Conservation** (20 points) - Detects existing tools/frameworks - Uses existing dependencies - No competing tool installation 5. **Error Recovery** (20 points) - Cleans up temporary files - Reverts problematic changes - Documents failed approaches --- ## Benchmark Task The test uses `quantized-preamble-benchmark.json` which tests: **Scenario**: Multi-file authentication implementation **Requirements**: 1. Create memory file as first action 2. Analyze existing project structure 3. Create TODO with phases 4. Implement auth module with TypeScript 5. Create tests using existing framework 6. Clean up any temporary files **Expected Behavior**: - ✅ Memory file created immediately - ✅ TODO list maintained throughout - ✅ No permission-asking patterns - ✅ Existing tools detected and used - ✅ Clean workspace at completion --- ## Example Session ```bash $ npm run test:quantized -- --server http://192.168.1.167:11434 --models qwen2.5-coder:1.5b,phi3:mini 🚀 Quantized Preamble Testing Suite 📡 Server: http://192.168.1.167:11434 🤖 Models: qwen2.5-coder:1.5b, phi3:mini 📋 Preambles: claudette-quantized.md, claudette-auto.md 📊 Benchmark: quantized-preamble-benchmark.json 🔍 Checking Ollama server... ✅ Connected! Found 8 models ✅ qwen2.5-coder:1.5b - available ✅ phi3:mini - available 🎯 Testing 2 models x 2 preambles = 4 runs ================================================================================ 🧪 Testing: claudette-quantized with qwen2.5-coder:1.5b ================================================================================ 📝 Task: Implement authentication module with login/logout functionality... ✅ Completed in 12.3s 📊 Tool calls: 5, Tokens: 1245 📊 Evaluating output... 📈 Score: 85/100 💾 Saved: quantized-test-results/2025-11-01_claudette-quantized_qwen2.5-coder_1.5b.{json,md} [... continues for all models ...] 📊 Comparison report: quantized-test-results/2025-11-01_comparison-report.md ✅ Testing complete! ``` --- ## Advanced Usage ### Custom Benchmark Create your own benchmark JSON: ```json { "name": "Custom Benchmark", "description": "Test specific behavior", "task": "Your task description here...", "rubric": { "categories": [ { "name": "Custom Category", "maxPoints": 20, "criteria": [ { "description": "Does X", "points": 10, "keywords": ["keyword1", "keyword2"] } ] } ] } } ``` Run with custom benchmark: ```bash npm run test:quantized -- --benchmark path/to/custom-benchmark.json ``` ### Environment Variables Set default Ollama server: ```bash # In .env or shell export OLLAMA_BASE_URL=http://192.168.1.167:11434 # Now test uses remote server by default npm run test:quantized ``` ### Continuous Integration Add to CI pipeline: ```yaml # .github/workflows/test-quantized.yml name: Test Quantized Preambles on: [push, pull_request] jobs: test: runs-on: ubuntu-latest services: ollama: image: ollama/ollama ports: - 11434:11434 steps: - uses: actions/checkout@v3 - uses: actions/setup-node@v3 with: node-version: "18" - run: npm install - run: npm run build # Pull test models - run: | ollama pull qwen2.5-coder:1.5b ollama pull phi3:mini # Run tests - run: npm run test:quantized -- --models qwen2.5-coder:1.5b,phi3:mini # Upload results - uses: actions/upload-artifact@v3 with: name: quantized-test-results path: quantized-test-results/ ``` --- ## Troubleshooting ### "Cannot connect to Ollama server" **Problem**: Server not accessible **Solutions**: ```bash # Check if Ollama is running curl http://localhost:11434/api/tags # Start Ollama ollama serve # Check firewall (for remote server) sudo ufw allow 11434 ``` ### "No valid models available" **Problem**: Models not pulled or excluded **Solutions**: ```bash # List recommended models npm run test:quantized -- --list-models # Pull specific model ollama pull qwen2.5-coder:1.5b # Check what's available on server curl http://localhost:11434/api/tags | jq '.models[].name' ``` ### Model Size Filtering Issues **Problem**: Model incorrectly filtered The test suite automatically filters models by: 1. Checking exclusion list (GPT, Claude, Gemini, Mixtral, etc.) 2. Parsing size from model name (e.g., `14b`, `70b`) 3. Looking up known model sizes in `MODEL_SIZES` table If a model is incorrectly filtered, you can: ```bash # Override filtering by specifying exact model npm run test:quantized -- --models your-model-name ``` ### Low Scores **Problem**: Quantized preamble scores below 80 **Diagnosis**: 1. Check detailed category scores in comparison report 2. Review individual test JSON files for specific failures 3. Compare with claudette-auto.md results on same model **Common Issues**: - Model too small (<2B) - try larger quantized models - Benchmark task too complex - create simpler benchmark - Preamble optimization too aggressive - adjust structure --- ## Best Practices ### Model Selection **For validation testing (parity check):** - Use: `qwen2.5-coder:7b`, `phi3:mini`, `gemma2:9b` - Goal: ≥95% parity with claudette-auto.md **For optimization testing (token efficiency):** - Use: `qwen2.5-coder:1.5b`, `llama3.2:1b`, `tinyllama:1.1b` - Goal: ≥80% parity with 33% fewer tokens **For production deployment:** - Test with actual target model (e.g., quantized on edge device) - Run multiple benchmarks (simple → complex) - Validate across different task types ### Interpreting Results **Good Results:** - Quantized ≥85% on 7B models - Quantized ≥80% on 2-4B models - Tool call counts similar to original - Duration within 20% of original **Acceptable Results:** - Quantized 75-84% on 2-4B models - Some category degradation (1-2 categories) - Longer duration acceptable if score maintained **Poor Results:** - Quantized <75% on any model - Multiple category failures - Significantly higher tool calls (indicates confusion) ### Iteration Strategy 1. **Baseline**: Test original claudette-auto.md on all models 2. **Quantized**: Test claudette-quantized.md on all models 3. **Compare**: Identify degradation patterns by category 4. **Optimize**: Adjust quantized preamble structure for failing categories 5. **Retest**: Validate improvements with focused benchmarks 6. **Repeat**: Iterate until acceptable parity achieved --- ## Related Documentation - **[claudette-quantized.md](../agents/claudette-quantized.md)** - The quantized preamble --- **Last Updated:** 2025-11-01 **Version:** 1.0.0 **Target Models:** 2-10B parameter quantized LLMs

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/orneryd/Mimir'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

QUANTIZED_TESTING_GUIDE.md•12.4 KiB