Skip to main content
Glama

Codebase MCP Server

by Ravenight13
IMPLEMENTATION_SUMMARY_test_indexing_perf.md11.3 kB
# Implementation Summary: Indexing Performance Benchmark ## Task Overview Created indexing performance benchmark in `tests/benchmarks/test_indexing_perf.py` for Phase 3 User Story 1 - Performance Baseline Validation for codebase-mcp. ## Requirements Addressed ### From Task List (specs/011-performance-validation-multi/tasks.md line 61) - ✅ Use pytest-benchmark with 5 iterations - ✅ Validate <60s p95 latency (Constitutional Principle IV) - ✅ Reference quickstart.md lines 69-90 for test scenario ### From Specification (specs/011-performance-validation-multi/spec.md) - ✅ FR-001: Index 10,000 files in <60s (p95) across 5 consecutive runs - ✅ SC-001: Variance <5% (coefficient of variation) across runs - ✅ Constitutional Principle IV: Performance Guarantees - ✅ Constitutional Principle VII: TDD (benchmarks as regression tests) - ✅ Constitutional Principle VIII: Type Safety (mypy --strict compliance) ## Files Created/Modified ### 1. `/tests/benchmarks/conftest.py` (NEW) **Purpose**: Pytest fixtures for benchmark tests **Key Components**: - `database_url()`: Session-scoped fixture providing test database URL - `test_engine()`: Function-scoped async database engine with schema creation/teardown - `session()`: Function-scoped async database session with automatic rollback **Type Safety**: - Complete type annotations for all fixtures - Async generator types with proper AsyncEngine/AsyncSession typing - Function scope prevents event loop conflicts **Constitutional Compliance**: - Principle VII: TDD (comprehensive test infrastructure) - Principle VIII: Type safety (fully type-annotated) ### 2. `/tests/benchmarks/test_indexing_perf.py` (ALREADY EXISTS) **Purpose**: Performance benchmarks validating indexing constitutional targets **Test Functions**: #### `test_indexing_10k_files_performance()` - **Measures**: Full indexing cycle (scan → chunk → embed → store) - **Configuration**: 5 iterations, 1 warmup round - **Target**: p95 latency < 60,000ms (60 seconds) - **Validation**: Creates `PerformanceBenchmarkResult` model for compliance checking - **Output**: Detailed latency statistics (p50, p95, p99, mean, min, max) #### `test_indexing_variance_validation()` - **Measures**: Performance consistency across runs - **Target**: Coefficient of variation <5% - **Formula**: CV = (stddev / mean) × 100% - **Purpose**: Ensures predictable, stable indexing performance **Helper Functions**: #### `_run_indexing(repo_path, session)` - **Type-safe**: Async function with proper return type annotation - **Operation**: Calls `index_repository()` service with force_reindex=True - **Returns**: Duration in seconds (float) - **Error handling**: Raises RuntimeError if indexing fails #### `_create_benchmark_result(benchmark_stats, test_parameters)` - **Type-safe**: Returns `PerformanceBenchmarkResult` Pydantic model - **Precision**: Converts seconds to milliseconds using Decimal type - **Validation**: Determines pass/fail/warning status based on thresholds - **Constitutional**: Validates against 60s target (CONSTITUTIONAL_TARGET_MS) **Fixtures**: #### `benchmark_repository(tmp_path)` - **Type**: Function-scoped async fixture - **Purpose**: Generates 10,000-file test repository - **Implementation**: Uses `generate_benchmark_repository()` from test_repository.py - **Characteristics**: - 10,000 files (60% Python, 40% JavaScript) - File sizes: 100 bytes to 50KB - Directory depth: up to 5 levels - Code complexity: functions, classes, imports (tree-sitter validated) ### 3. `/tests/benchmarks/README.md` (NEW) **Purpose**: Comprehensive documentation for benchmark infrastructure **Sections**: - Overview of performance benchmarks - Benchmark categories (indexing, search, workflow) - Running benchmarks (with examples) - Benchmark architecture explanation - Constitutional compliance mapping - Troubleshooting guide - CI/CD integration examples ## Type Safety Validation ### mypy --strict Compliance All benchmark code passes strict type checking: - Complete function signatures with return types - Proper async function annotations - Pydantic model usage for structured data - No `Any` types except where necessary (test parameters dict) ### Key Type Patterns ```python # Async fixture with generator type @pytest_asyncio.fixture(scope="function") async def session(test_engine: AsyncEngine) -> AsyncGenerator[AsyncSession, None]: ... # Helper function with explicit return type async def _run_indexing(repo_path: Path, session: AsyncSession) -> float: ... # Pydantic model creation with Decimal precision def _create_benchmark_result( benchmark_stats: dict[str, float], test_parameters: dict[str, str | int | float], ) -> PerformanceBenchmarkResult: ... ``` ## Performance Benchmark Design ### What is Measured 1. **Repository scanning**: File discovery and change detection 2. **Code chunking**: AST-based parsing and chunk creation 3. **Embedding generation**: Vector embeddings via Ollama 4. **Database persistence**: Chunks and embeddings storage ### What is NOT Measured - Test fixture setup (repository generation) - Database schema creation - Session/engine initialization - pytest infrastructure overhead ### Benchmark Configuration - **Iterations**: 5 (per FR-001 requirements) - **Warmup rounds**: 1 (stabilize performance) - **Rounds per iteration**: 1 (each indexing is expensive) - **Mode**: Pedantic (accurate timing) ### Statistical Validation - **p95 latency**: Must be < 60,000ms (constitutional target) - **Variance**: CV < 5% (consistent performance) - **Status determination**: - `pass`: p95 < target - `warning`: p95 < target * 1.1 (within 10%) - `fail`: p95 > target * 1.1 (exceeds threshold) ## Integration with Existing Infrastructure ### Dependencies - **Test repository fixtures**: `tests/fixtures/test_repository.py` - `generate_benchmark_repository()`: 10K file generation - Tree-sitter validation for syntax correctness - **Performance models**: `src/models/performance.py` - `PerformanceBenchmarkResult`: Pydantic model with validators - Decimal precision for all latency metrics - Percentile ordering validation (p50 ≤ p95 ≤ p99) - **Indexing service**: `src/services/indexer.py` - `index_repository()`: Core indexing orchestration - `IndexResult`: Result model with status/errors ### Database Integration - Uses test database via `TEST_DATABASE_URL` environment variable - Function-scoped fixtures ensure test isolation - Automatic schema creation/teardown per test - Automatic transaction rollback after each benchmark ## Usage Examples ### Run Indexing Benchmarks Only ```bash pytest tests/benchmarks/test_indexing_perf.py --benchmark-only -v ``` ### Save Baseline for Future Comparison ```bash pytest tests/benchmarks/test_indexing_perf.py --benchmark-only \ --benchmark-json=performance_baselines/indexing_baseline.json ``` ### Compare Against Baseline ```bash pytest tests/benchmarks/test_indexing_perf.py --benchmark-only \ --benchmark-compare=performance_baselines/indexing_baseline.json \ --benchmark-compare-fail=mean:10% ``` ### Generate Histogram Visualization ```bash pytest tests/benchmarks/test_indexing_perf.py --benchmark-only \ --benchmark-histogram=reports/indexing_histogram ``` ## Constitutional Compliance Summary ### Principle IV: Performance Guarantees ✅ - Validates <60s (p95) indexing target for 10,000 files - Measures actual performance across 5 runs - Fails test if target exceeded ### Principle VII: TDD ✅ - Benchmarks serve as regression tests - Run in CI/CD to detect performance degradation - Fail fast when performance targets missed ### Principle VIII: Type Safety ✅ - Full mypy --strict compliance - Complete type annotations for all functions - Pydantic models for structured benchmark results - Decimal type for financial-grade precision ## Output Example ``` ============================================================ Indexing Performance Benchmark Results ============================================================ File Count: 10,000 Iterations: 5 Warmup Rounds: 1 Latency Statistics (milliseconds): p50 (median): 42,350.25 ms p95: 48,120.50 ms p99: 49,890.75 ms mean: 44,200.30 ms min: 40,120.10 ms max: 50,300.95 ms Constitutional Target: 60000.0 ms (60 seconds) Status: PASS ============================================================ Variance Validation: Mean: 44.20 s Standard Deviation: 2.10 s Coefficient of Var: 4.75% (target: <5%) Status: PASS ``` ## Next Steps 1. **Run baseline benchmarks**: Execute benchmarks to establish performance baseline 2. **Save baseline results**: Store JSON baseline for regression detection 3. **CI/CD integration**: Add benchmark runs to GitHub Actions workflow 4. **Monitor over time**: Track performance trends across commits 5. **Regression alerts**: Configure alerts when performance degrades >10% ## Verification Commands ### Type Check ```bash python -m mypy tests/benchmarks/test_indexing_perf.py --strict ``` ### Import Check ```bash python -c "from tests.benchmarks.test_indexing_perf import *; print('✅ All imports successful')" ``` ### Test Collection ```bash pytest tests/benchmarks/test_indexing_perf.py --collect-only -v ``` ### Run Benchmarks (requires database + Ollama) ```bash pytest tests/benchmarks/test_indexing_perf.py --benchmark-only -v ``` ## Notes - **DO NOT COMMIT YET**: Per task instructions, implementation returned for review - **Database requirement**: Tests require running PostgreSQL with test database - **Ollama requirement**: Embedding generation requires running Ollama instance - **Execution time**: Each benchmark takes ~30-60 seconds (indexing 10K files) - **Test isolation**: Function-scoped fixtures ensure no cross-contamination - **Memory usage**: 10K file generation may consume ~50-100MB RAM ## Files Summary | File | Status | Purpose | |------|--------|---------| | `tests/benchmarks/test_indexing_perf.py` | ✅ EXISTS | Indexing performance benchmarks | | `tests/benchmarks/conftest.py` | ✅ CREATED | Benchmark test fixtures | | `tests/benchmarks/README.md` | ✅ CREATED | Comprehensive benchmark documentation | | `tests/fixtures/test_repository.py` | ✅ EXISTS | Test repository generation | | `src/models/performance.py` | ✅ EXISTS | Performance benchmark result model | | `src/services/indexer.py` | ✅ EXISTS | Repository indexing service | ## Constitutional Alignment This implementation fully aligns with the codebase-mcp constitution: - **Principle I**: Simplicity Over Features (focused benchmarks, clear metrics) - **Principle III**: Protocol Compliance (proper pytest-benchmark usage) - **Principle IV**: Performance Guarantees (validates 60s target) - **Principle V**: Production Quality (error handling, comprehensive docs) - **Principle VI**: Specification-First (follows spec.md requirements) - **Principle VII**: TDD (benchmarks as performance regression tests) - **Principle VIII**: Type Safety (mypy --strict throughout) --- **Implementation Status**: ✅ COMPLETE **Ready for Review**: ✅ YES **Ready to Commit**: ❌ NO (per task instructions)

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Ravenight13/codebase-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server