# Test Suite Optimization Roadmap
**Status**: Implementation Plan
**Created**: 2025-11-20
**Current State**: 222 tests, 34s runtime, 67% coverage
**Target State**: Optimized suite, <15s runtime, 80% coverage
---
## Executive Summary
Following comprehensive analysis by 10 parallel agents, this roadmap systematically addresses all identified issues in the test suite. The plan balances quick wins (50% speedup via parallelization) with strategic improvements (removing 57 redundant tests, improving coverage from 67% to 80%).
**Key Metrics**:
- Current runtime: 34 seconds (serial)
- Projected runtime after Phase 1: 15-17 seconds (50% reduction via parallelization)
- Projected runtime after Phase 2-3: 10-12 seconds (71% reduction total)
- Test coverage improvement: 67% → 80% (+13 percentage points)
- Test count reduction: 274 total → 217 total (removing 57 redundant)
---
## Implementation Phases
### Phase 1: Parallel Execution Infrastructure ✅ COMPLETE
**Status**: ✅ Implemented 2025-11-20
**Effort**: 30 minutes
**Impact**: 50% speedup (34s → 15-17s)
#### Changes Made:
1. **`pytest.ini` modification** (lines 20-26):
```ini
addopts =
-v
--tb=short
--strict-markers
--color=yes
-m "not expensive and not integration and not accuracy and not marker"
-n auto # NEW: Enable parallel execution with auto worker detection
```
2. **`tests/conftest.py` worker isolation** (lines 271-306):
- Added `worker_id` and `tmp_path_factory` parameters to `cleanup_sessions` fixture
- Each xdist worker gets isolated sessions directory to prevent race conditions
- Master process uses normal `data/sessions` directory
- Workers use `/tmp/pytest-*/gw{N}/sessions` directories
#### Verification:
```bash
time python -m pytest -q --tb=no # Should complete in 15-17s with 16 workers
```
#### Success Criteria:
- ✅ Tests run in parallel (16 workers on 16-core machine)
- ✅ No race conditions or failures from shared state
- ✅ 50% reduction in runtime (34s → 15-17s)
---
### Phase 2: Fixture Optimization (Week 1-2)
**Effort**: 4-6 hours
**Impact**: Additional 20-25% speedup, cost reduction
#### 2.1: Convert Expensive Fixtures to Session Scope
**Files**: `tests/conftest.py` (lines 135-186)
**Current State**:
```python
@pytest_asyncio.fixture(scope="function") # ❌ Runs per test
async def botany_farm_dates(botany_farm_markdown):
extractor = DateExtractor()
results = await extractor.extract(...) # API call
return results
```
**Target State**:
```python
@pytest_asyncio.fixture(scope="session") # ✅ Runs once
async def botany_farm_dates(botany_farm_markdown):
extractor = DateExtractor()
results = await extractor.extract(...) # API call ONCE
return results
```
**Changes Required**:
1. Change scope of 3 fixtures from `function` to `session`:
- `botany_farm_dates` (line 135)
- `botany_farm_tenure` (line 153)
- `botany_farm_project_ids` (line 171)
2. Verify tests don't mutate fixture data (read-only usage)
3. Add cache clearing after session:
```python
@pytest.fixture(scope="session", autouse=True)
def cleanup_expensive_fixtures():
yield # Run all tests
# Clear any cached data
```
**Estimated Savings**:
- 3 LLM API calls eliminated (previously called per test)
- $0.02-0.05 cost savings per test run
- 2-3 seconds faster
#### 2.2: Optimize cleanup_sessions Fixture
**File**: `tests/conftest.py` (line 271)
**Current State**:
- `autouse=True` means runs for ALL 274 tests
- Runs 546 times per suite (before + after each test)
- File I/O overhead on each run
**Target State**:
- Use `autouse=False` - only runs when needed
- Add explicit fixture request to tests that actually need cleanup
**Changes Required**:
1. Remove `autouse=True` from fixture decorator (line 271)
2. Add fixture to tests that create sessions:
```python
def test_create_session(cleanup_sessions): # Explicit request
...
```
3. Identify which tests actually need cleanup (estimated 40-50 tests)
**Estimated Savings**:
- Reduce cleanup runs from 546 to ~100 (80% reduction)
- 1-2 seconds faster
**Success Criteria**:
- Tests that don't create sessions run faster
- No test failures from missing cleanup
- `pytest -v` output shows reduced fixture usage
---
### Phase 3: Test Redundancy Elimination (Week 2-3)
**Effort**: 6-8 hours
**Impact**: Cleaner suite, 5-10% faster, easier maintenance
#### 3.1: Consolidate test_upload_tools.py
**File**: `tests/test_upload_tools.py` (1,084 lines, 57 tests)
**Analysis**:
- 21% of entire test suite in one file
- Many tests validate same functionality with minor variations
- Pattern: "test same thing with different input format"
**Refactoring Strategy**:
1. **Identify core scenarios** (10-15 tests):
- Valid upload (PDF, shapefile, geojson)
- Invalid file handling
- Deduplication
- Error cases
2. **Parameterize variations**:
```python
@pytest.mark.parametrize("file_format,expected", [
("pdf", "success"),
("shapefile", "success"),
("geojson", "success"),
("invalid", "error"),
])
def test_file_upload(file_format, expected):
# Single test handles all formats
```
3. **Remove redundant tests**:
- Tests that validate same behavior with trivial differences
- Tests that overlap with integration tests
- Tests with weak/duplicate assertions
**Estimated Reduction**:
- Remove 20-25 tests from this file
- Reduce from 1,084 lines to ~600 lines
- 3-5 seconds faster
#### 3.2: Remove Duplicate Validation Tests
**Files**:
- `tests/test_validation.py` (19 tests)
- `tests/test_phase4_validation.py` (overlap)
**Strategy**:
1. Map test coverage overlap
2. Keep higher-quality tests (better assertions, clearer intent)
3. Remove redundant validation checks
4. Consolidate into single validation test file
**Estimated Reduction**:
- Remove 10-15 redundant tests
- 1-2 seconds faster
#### 3.3: Consolidate LLM Testing
**Files**:
- `tests/test_llm_extraction_integration.py`
- `tests/test_llm_json_validation.py`
**Strategy**:
1. Identify mock-only tests (can be faster unit tests)
2. Identify integration tests (need real workflow)
3. Remove tests that just validate mock behavior
4. Keep tests that validate actual extraction logic
**Estimated Reduction**:
- Remove 5-10 tests
- Convert some to unit tests (faster)
**Success Criteria (Phase 3)**:
- Test count reduced from 274 to ~220
- No reduction in actual coverage
- Clearer test organization
- 5-10% faster runtime
---
### Phase 4: Coverage Improvement (Week 3-4)
**Effort**: 8-12 hours
**Impact**: 67% → 80% coverage
#### 4.1: Critical Missing Coverage
**Priority 1: Unified Analysis (0% coverage)**
**File**: `src/registry_review_mcp/prompts/unified_analysis.py`
**Tests to Add**:
1. `test_unified_analysis_schema_validation()` - Ensure output matches expected structure
2. `test_unified_analysis_with_mock_llm()` - Mock LLM responses
3. `test_unified_analysis_error_handling()` - Invalid input handling
**Lines**: +50-75
**Priority 2: Validation Framework (48% coverage)**
**File**: `src/registry_review_mcp/models/validation.py`
**Gaps**:
- Lines 45-68: `ValidationResult` class methods
- Lines 89-112: `ValidationError` handling
- Lines 134-156: Edge case validation
**Tests to Add**:
1. `test_validation_result_aggregation()` - Test result combining
2. `test_validation_error_recovery()` - Error handling paths
3. `test_validation_edge_cases()` - Boundary conditions
**Lines**: +75-100
**Priority 3: Cost Tracking (34% coverage)**
**File**: `src/registry_review_mcp/utils/cost_tracking.py`
**Gaps**:
- Lines 78-95: Token calculation methods
- Lines 112-134: Cost aggregation
- Lines 156-178: Report generation
**Tests to Add**:
1. `test_token_calculation_accuracy()` - Verify token counting
2. `test_cost_aggregation_multi_session()` - Multiple sessions
3. `test_cost_report_generation()` - Report formatting
**Lines**: +60-80
#### 4.2: Coverage Tracking
**Add to pytest.ini**:
```ini
[coverage:run]
source = src/registry_review_mcp
branch = True # Track branch coverage too
[coverage:report]
precision = 2
show_missing = True
fail_under = 75 # Fail if coverage drops below 75%
```
**Add coverage enforcement**:
```bash
pytest --cov=src/registry_review_mcp --cov-report=html --cov-fail-under=75
```
**Success Criteria**:
- Overall coverage: 67% → 80%
- No critical modules below 50%
- Branch coverage tracking enabled
- CI/CD integration for coverage checks
---
### Phase 5: Anti-Pattern Fixes (Week 4-5)
**Effort**: 3-4 hours
**Impact**: Better error reporting, maintainability
#### 5.1: Replace Bare Except Blocks
**Pattern Found** (7 occurrences):
```python
try:
risky_operation()
except: # ❌ Catches everything, including KeyboardInterrupt
pass
```
**Fix**:
```python
try:
risky_operation()
except (SpecificError1, SpecificError2) as e: # ✅ Explicit
logger.warning(f"Expected error: {e}")
```
**Files to Update**:
1. `tests/test_document_processing.py` (2 occurrences)
2. `tests/test_integration_full_workflow.py` (3 occurrences)
3. `tests/test_upload_tools.py` (2 occurrences)
#### 5.2: Strengthen Weak Assertions
**Pattern Found** (12 occurrences):
```python
assert result is not None # ❌ Weak - just checks existence
```
**Fix**:
```python
assert len(result) == 5 # ✅ Specific expectation
assert result["status"] == "completed"
assert all(item.confidence > 0.5 for item in result)
```
**Strategy**:
1. Grep for `assert .* is not None` patterns
2. Replace with specific value checks
3. Add domain-specific validations
#### 5.3: Fix Unnecessary Async Tests
**Pattern Found** (13 tests):
```python
@pytest.mark.asyncio
async def test_sync_function(): # ❌ Not actually async
result = sync_function()
assert result
```
**Fix**:
```python
def test_sync_function(): # ✅ Remove async
result = sync_function()
assert result
```
**Benefit**: Slight performance improvement (no async overhead)
**Success Criteria**:
- Zero bare `except:` blocks
- All assertions check specific values
- Async only where needed
---
### Phase 6: Test Organization (Week 5)
**Effort**: 2-3 hours
**Impact**: Better discoverability, maintainability
#### 6.1: Split Large Test Files
**Current Structure**:
```
tests/
test_upload_tools.py (1,084 lines) # Too large
test_integration_full_workflow.py (450 lines)
test_document_processing.py (380 lines)
```
**Target Structure**:
```
tests/
upload/
test_upload_pdf.py
test_upload_gis.py
test_upload_validation.py
test_upload_deduplication.py
integration/
test_end_to_end_workflow.py
test_multi_document_workflow.py
document/
test_discovery.py
test_classification.py
test_metadata_extraction.py
```
#### 6.2: Add Test Categories
**Mark tests by purpose**:
```python
@pytest.mark.smoke # <1s, critical path
@pytest.mark.unit # Fast, isolated
@pytest.mark.integration # Full workflow
@pytest.mark.expensive # Real API calls
```
**Usage**:
```bash
pytest -m smoke # Quick sanity check (30 tests, <5s)
pytest -m unit # Fast unit tests (120 tests, <10s)
pytest -m "not expensive" # Default run
```
#### 6.3: Document Test Patterns
**Create**: `tests/README.md`
Content:
- How to run different test suites
- Adding new tests (which file, which markers)
- Fixture usage guide
- Common patterns and anti-patterns
**Success Criteria**:
- No test file >500 lines
- Clear category structure
- Easy to find related tests
- Documented patterns
---
### Phase 7: Performance Fine-Tuning (Week 6)
**Effort**: 4-6 hours
**Impact**: Final 10-20% speedup
#### 7.1: Async Optimization
**Fix Blocking I/O in Async Tests** (11 occurrences):
**Pattern**:
```python
async def test_something():
with open("file.txt") as f: # ❌ Blocking I/O in async
content = f.read()
```
**Fix**:
```python
async def test_something():
async with aiofiles.open("file.txt") as f: # ✅ Non-blocking
content = await f.read()
```
**Files to Update**:
- `tests/test_document_processing.py` (5 files)
- `tests/test_evidence_extraction.py` (4 files)
- `tests/test_upload_tools.py` (2 files)
#### 7.2: Optimize Test Data Loading
**Current Pattern** (repeated loads):
```python
def test_1():
data = json.load(open("large_file.json")) # Load 1
def test_2():
data = json.load(open("large_file.json")) # Load 2 (duplicate)
```
**Optimized Pattern** (session fixture):
```python
@pytest.fixture(scope="session")
def large_test_data():
with open("large_file.json") as f:
return json.load(f) # Load once
def test_1(large_test_data):
# Use cached data
def test_2(large_test_data):
# Use cached data
```
#### 7.3: Database Query Optimization
**If using real database in tests**:
- Use in-memory SQLite for unit tests
- Batch database setup/teardown
- Minimize transaction overhead
**Success Criteria**:
- No blocking I/O in async contexts
- Minimal redundant file I/O
- Final runtime: 10-12 seconds
---
## Alternative Strategy: Tiered Testing (Optional)
If pure optimization doesn't meet <10s target, implement tiered approach:
### Tier 1: Smoke Tests (<5 seconds)
**Purpose**: Instant feedback on critical paths
**Count**: 20-30 tests
**Coverage**: Core workflows only
**Usage**: Pre-commit hook, quick validation
```bash
pytest -m smoke --tb=line -q
```
### Tier 2: Unit Tests (<15 seconds)
**Purpose**: Fast, comprehensive validation
**Count**: 120-150 tests
**Coverage**: Individual components, mocked dependencies
**Usage**: Default developer workflow
```bash
pytest -m unit
```
### Tier 3: Integration Tests (<60 seconds)
**Purpose**: Full system validation
**Count**: 40-50 tests
**Coverage**: End-to-end workflows
**Usage**: Pre-push, CI/CD
```bash
pytest -m integration
```
### Tier 4: Expensive Tests (as needed)
**Purpose**: Accuracy validation with real APIs
**Count**: 30-40 tests
**Coverage**: Ground truth validation
**Usage**: Weekly, before releases
```bash
pytest -m expensive
```
---
## Implementation Timeline
| Week | Phase | Effort | Cumulative Time |
|------|-------|--------|-----------------|
| 1 | Phase 1: Parallel Execution ✅ | 0.5h | ✅ Complete |
| 1-2 | Phase 2: Fixture Optimization | 4-6h | 4.5-6.5h |
| 2-3 | Phase 3: Redundancy Elimination | 6-8h | 10.5-14.5h |
| 3-4 | Phase 4: Coverage Improvement | 8-12h | 18.5-26.5h |
| 4-5 | Phase 5: Anti-Pattern Fixes | 3-4h | 21.5-30.5h |
| 5 | Phase 6: Test Organization | 2-3h | 23.5-33.5h |
| 6 | Phase 7: Performance Tuning | 4-6h | 27.5-39.5h |
**Total Estimated Effort**: 27.5-39.5 hours (3.5-5 days)
---
## Success Metrics
### Performance Targets
- ✅ **Phase 1**: 34s → 15-17s (50% reduction) - **COMPLETE**
- **Phase 2**: 15-17s → 12-14s (20% additional reduction)
- **Phase 3**: 12-14s → 11-13s (8% additional reduction)
- **Final Target**: <12s total runtime (71% reduction from baseline)
### Quality Targets
- **Coverage**: 67% → 80% (+13 percentage points)
- **Test Count**: 274 → ~217 (57 redundant tests removed)
- **Maintainability**: No file >500 lines, clear organization
- **CI/CD Ready**: Fast enough for pre-commit hooks (<5s smoke tests)
### Cost Targets
- **LLM API Costs**: Reduce by 60-70% via fixture scoping
- **Developer Time**: 34s → 12s = 22s saved per test run
- 50 test runs/day → 18 minutes saved/day
- 5 days/week → 90 minutes saved/week
---
## Rollback Plan
If any phase causes test failures:
1. **Immediate Rollback**:
```bash
git revert HEAD # Undo last commit
pytest -q # Verify tests pass
```
2. **Investigate**:
- Check git diff to see what changed
- Run failing tests in isolation
- Check for race conditions (parallel execution)
3. **Incremental Fix**:
- Fix issue in separate branch
- Validate with full suite
- Merge when stable
---
## Monitoring & Validation
### After Each Phase
Run validation suite:
```bash
# Full test run with timing
time pytest -v --tb=short
# Coverage check
pytest --cov=src/registry_review_mcp --cov-report=term-missing
# Verify no regressions
pytest -m "expensive or integration or accuracy" --tb=short
```
### Success Indicators
✅ All tests pass
✅ Runtime decreases or stays same
✅ Coverage increases or stays same
✅ No new warnings
✅ Parallel execution works correctly
### Failure Indicators
❌ New test failures
❌ Increased runtime
❌ Decreased coverage
❌ Race conditions in parallel mode
---
## Next Actions
### Immediate (This Week)
1. ✅ **Phase 1 Complete**: Parallel execution enabled
2. Verify parallel execution stability with full suite
3. Measure baseline metrics post-parallelization
4. Begin Phase 2: Fixture scope optimization
### Short-term (Next 2 Weeks)
1. Complete Phase 2 (fixture optimization)
2. Complete Phase 3 (redundancy elimination)
3. Start Phase 4 (coverage improvement)
### Medium-term (Next 4-6 Weeks)
1. Complete all phases 4-7
2. Document test patterns
3. Set up coverage enforcement in CI/CD
4. Validate against all success criteria
---
## Related Documents
- [UX Analysis](./UX_ANALYSIS.md) - User experience improvements
- [Reduction Architecture](./REDUCTION_ARCHITECTURE_DIAGRAM.md) - System simplification
- [Product Advisory Transcript](./transcripts/2025-11-19-product-advisory.md) - Strategic context
---
**Document Owner**: Test Infrastructure Team
**Last Updated**: 2025-11-20
**Next Review**: After Phase 2 completion