M.I.M.I.R - Multi-agent Intelligent Memory & Insight Repository

Overview Schema Related Servers Score Discussions

search-evaluation.md•7.77 KiB

# NornicDB Eval Harness **Search Quality Evaluation & Validation** ## Overview The eval harness provides automated testing and validation of NornicDB's search quality. It computes standard Information Retrieval (IR) metrics and reports pass/fail based on configurable thresholds. ## Quick Start ```bash # Run against running server with built-in tests cd nornicdb go run ./cmd/eval # Run with custom test suite go run ./cmd/eval -suite path/to/tests.json # Output JSON for CI/CD go run ./cmd/eval -output json -save results.json ``` ## Metrics Computed | Metric | Description | Range | |--------|-------------|-------| | **Precision@K** | Fraction of top-K results that are relevant | 0-1 | | **Recall@K** | Fraction of all relevant docs in top-K | 0-1 | | **MRR** | Mean Reciprocal Rank - where first relevant result appears | 0-1 | | **NDCG@K** | Normalized Discounted Cumulative Gain - ranking quality | 0-1 | | **MAP** | Mean Average Precision | 0-1 | | **Hit Rate** | Fraction of queries with at least one relevant result | 0-1 | ### ELI12 (Explain Like I'm 12) Think of it like grading a spelling bee: - **Precision**: "How many of your first 10 guesses were correct?" - **Recall**: "Of all the correct answers, how many did you find?" - **MRR**: "How quickly did you get your first right answer?" - **Hit Rate**: "Did you get at least one right?" ## Command-Line Options ```bash go run ./cmd/eval [flags] Flags: -url string NornicDB server URL (default "http://localhost:7474") -suite string Path to test suite JSON file -output string Output format: summary, detailed, json, compact (default "summary") -save string Save results to JSON file -threshold string Override thresholds (format: p10=0.5,mrr=0.5,hit=0.8) -create-sample Create sample test data in the database ``` ## Test Suite Format ```json { "name": "my-test-suite", "description": "Search quality tests", "version": "1.0.0", "test_cases": [ { "name": "ML Concept Search", "query": "machine learning neural networks", "expected": ["node-id-1", "node-id-2"], "tags": ["ml", "concepts"] }, { "name": "Graded Relevance Test", "query": "database architecture", "expected": ["db-1", "db-2", "db-3"], "relevance_grades": { "db-1": 3, "db-2": 2, "db-3": 1 }, "tags": ["database"] } ] } ``` ### Test Case Fields | Field | Type | Description | |-------|------|-------------| | `name` | string | Human-readable test name | | `query` | string | Search query text | | `expected` | []string | Node IDs that should be returned | | `relevance_grades` | map[string]int | Optional graded relevance (0-3) for NDCG | | `tags` | []string | Optional tags for filtering | ## Default Thresholds ```go Thresholds{ Precision10: 0.5, // At least 50% of top-10 relevant Recall10: 0.3, // At least 30% of relevant in top-10 MRR: 0.5, // First relevant in top 2 on average NDCG10: 0.5, // Reasonable ranking quality HitRate: 0.8, // 80% of queries have at least one hit } ``` ## Output Formats ### Summary (default) ``` ╔════════════════════════════════════════════════════════════════╗ ║ NornicDB Search Evaluation Results ║ ╚════════════════════════════════════════════════════════════════╝ 📊 Suite: my-tests 📅 Time: 2025-12-01T09:10:12-07:00 ⏱️ Duration: 125ms ✅ Tests: 5/5 passed (100.0%) ┌─────────────────────────────────────────────────────────────────┐ │ Aggregate Metrics │ ├─────────────────────────────────────────────────────────────────┤ │ ✓ MRR [████████████████████] 1.000 (target: 0.50) │ ✓ Recall@10 [████████████████████] 1.000 (target: 0.30) │ ✓ Hit Rate [████████████████████] 1.000 (target: 0.80) └─────────────────────────────────────────────────────────────────┘ ``` ### Detailed Includes per-test breakdown: ``` ✅ Test 1: ML Search Query: "machine learning" Method: http | Duration: 1.552ms P@10: 0.10 | R@10: 1.00 | MRR: 1.00 | NDCG@10: 1.00 Expected: 1 | Returned: 1 | Hits: 1 ``` ### Compact One-line summary for CI logs: ``` [PASS] 5/5 tests | P@10=0.10 R@10=1.00 MRR=1.00 NDCG=1.00 HitRate=1.00 | 8ms ``` ### JSON Full structured output for programmatic processing. ## CI/CD Integration ### GitHub Actions Example ```yaml - name: Run Search Quality Tests run: | cd nornicdb go run ./cmd/eval \ -suite tests/search_quality.json \ -output json \ -save eval-results.json \ -threshold="hit=0.9,mrr=0.7" - name: Upload Results uses: actions/upload-artifact@v3 with: name: eval-results path: nornicdb/eval-results.json ``` ### Exit Codes - `0`: All tests passed - `1`: One or more tests failed ## Programmatic Usage ```go import ( "github.com/orneryd/nornicdb/pkg/eval" "github.com/orneryd/nornicdb/pkg/search" ) // Create harness harness := eval.NewHarness(searchService) // Add test cases harness.AddTestCase(eval.TestCase{ Name: "ML concepts", Query: "machine learning", Expected: []string{"node-1", "node-2"}, }) // Set custom thresholds harness.SetThresholds(eval.Thresholds{ MRR: 0.7, HitRate: 0.9, }) // Run evaluation result, err := harness.Run(ctx) // Output results reporter := eval.NewReporter(os.Stdout) reporter.PrintSummary(result) ``` ## Best Practices ### 1. Use Real Node IDs Test cases should use actual storage node IDs, not user-defined `id` properties: ```json // ✅ Good - uses actual storage IDs "expected": ["n1", "node-abc123"] // ❌ Bad - uses property values "expected": ["my-custom-id"] ``` ### 2. Use Graded Relevance for NDCG For meaningful NDCG scores, provide relevance grades: ```json "relevance_grades": { "highly-relevant-doc": 3, "relevant-doc": 2, "marginal-doc": 1, "irrelevant-doc": 0 } ``` ### 3. Set Realistic Thresholds Start with lenient thresholds and tighten as search quality improves: ```bash # Development -threshold="hit=0.5,mrr=0.3" # Production -threshold="hit=0.9,mrr=0.7,p10=0.5" ``` ### 4. Tag Tests for Filtering Use tags to organize and filter tests: ```json "tags": ["semantic", "ml", "critical"] ``` ## Troubleshooting ### Low Precision but High Recall Search returns many results but expected docs are scattered. - **Fix**: Improve ranking algorithm or add MMR diversification ### Zero Hit Rate No expected docs found in any results. - **Check**: Are expected IDs correct (storage IDs, not properties)? - **Check**: Has the search index been rebuilt? ```bash curl -X POST http://localhost:7474/nornicdb/search/rebuild ``` ### Slow Evaluation - Reduce `limit` in search options - Use fewer test cases for quick iteration - Run full suite only in CI ## Related Documentation - [Vector Search Guide](../user-guides/vector-search.md) - [Hybrid Search Guide](../user-guides/hybrid-search.md) - [Hybrid Search - RRF Algorithm](../user-guides/hybrid-search.md#rrf-algorithm) --- _Eval Harness v1.0 - December 2025_

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/orneryd/Mimir'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

search-evaluation.md•7.77 KiB