folder-mcp

Overview Schema Related Servers Score Discussions

semantic-data-extraction-sprint-2.md•8.62 KiB

# Sprint 2 Lite: Simple Readability Fix **Sprint Duration**: 2-3 hours **Complexity**: Easy **Impact**: Low but necessary cleanup ## Executive Summary Fix the broken readability scoring with a simple, fast Coleman-Liau formula that requires no syllable counting. Current scores (30-42) will be calibrated to technical document ranges (40-60). This is a quick win before moving to higher-value work. ## Current State Analysis ### What's Broken ```sql -- Current readability scores from our test: -- All models showing 30-42 for technical documents -- Should be 40-60 according to requirements sqlite3 /Users/hanan/Projects/folder-mcp/.folder-mcp/embeddings.db \ "SELECT AVG(readability_score), MIN(readability_score), MAX(readability_score) FROM chunks WHERE readability_score IS NOT NULL" ``` ### The Problem - Current implementation likely uses broken syllable counting - Scores are too low for technical content (30s instead of 40-60) - All models show identical scores (suggests calculation happens once, pre-embedding) ## Implementation Plan: Coleman-Liau Formula ### Why Coleman-Liau? 1. **No syllable counting needed** - uses character counts instead 2. **Fast calculation** - simple arithmetic operations 3. **Good for technical content** - handles jargon well 4. **Battle-tested** - used in many readability tools ### The Formula ```typescript // Coleman-Liau Index CLI = 0.0588 * L - 0.296 * S - 15.8 Where: - L = average number of letters per 100 words - S = average number of sentences per 100 words ``` ## TMOAT Implementation Steps ### Step 1: Understand Current Implementation ```bash # Find where readability is calculated grep -r "readabilityScore\|readability_score" src/ --include="*.ts" # Expected location: src/domain/content/processing.ts or similar ``` ### Step 2: Create Simple Readability Service ```typescript // src/domain/semantic/algorithms/readability-calculator.ts export class ReadabilityCalculator { /** * Calculate readability using Coleman-Liau Index * No syllable counting required - uses character counts * Calibrated for technical documentation (40-60 range) */ calculate(text: string): number { // Count basic metrics const sentences = this.countSentences(text); const words = this.countWords(text); const letters = this.countLetters(text); // Avoid division by zero if (words === 0 || sentences === 0) { return 50; // Default middle score for edge cases } // Coleman-Liau formula components const L = (letters / words) * 100; // Letters per 100 words const S = (sentences / words) * 100; // Sentences per 100 words // Calculate raw Coleman-Liau score const rawScore = 0.0588 * L - 0.296 * S - 15.8; // Calibrate for technical documents (40-60 range) // Raw scores typically 5-15, we map to 40-60 const calibrated = 40 + (rawScore * 1.5); // Clamp to valid range return Math.max(40, Math.min(60, calibrated)); } private countSentences(text: string): number { // Count sentence endings (.!?) followed by space or EOL const matches = text.match(/[.!?]+[\s\n]/g) || []; return Math.max(1, matches.length); } private countWords(text: string): number { const words = text.match(/\b\w+\b/g) || []; return words.length; } private countLetters(text: string): number { const letters = text.match(/[a-zA-Z]/g) || []; return letters.length; } } ``` ### Step 3: Test Calculation Standalone ```typescript // tmp/test-readability.ts import { ReadabilityCalculator } from '../src/domain/semantic/algorithms/readability-calculator'; const testTexts = { simple: "This is a simple sentence. It has short words. Easy to read.", technical: "The semantic extraction service implements KeyBERT algorithms for multiword keyphrase extraction utilizing transformer-based embeddings.", complex: "Pursuant to the aforementioned architectural considerations, the implementation necessitates comprehensive refactoring of the ContentProcessingService to accommodate the multifaceted requirements of semantic extraction methodologies." }; const calculator = new ReadabilityCalculator(); Object.entries(testTexts).forEach(([type, text]) => { const score = calculator.calculate(text); console.log(`${type}: ${score.toFixed(1)}`); }); // Expected output: // simple: 40-45 (lower complexity) // technical: 48-52 (medium complexity) // complex: 55-60 (high complexity) ``` ### Step 4: Integration Points Find and update where readability is calculated: ```bash # Current calculation location grep -r "readabilityScore" src/ --include="*.ts" -A 5 -B 5 # Likely in SemanticExtractionService or ContentProcessingService ``` Update the integration: ```typescript // In semantic extraction service import { ReadabilityCalculator } from './algorithms/readability-calculator'; export class SemanticExtractionService { private readabilityCalculator = new ReadabilityCalculator(); async extractSemanticData(text: string, embeddings?: Float32Array) { const keyPhrases = await this.extractKeyPhrases(text, embeddings); const topics = await this.extractTopics(text, embeddings); const readabilityScore = this.readabilityCalculator.calculate(text); return { keyPhrases, topics, readabilityScore }; } } ``` ### Step 5: TMOAT Validation #### Test 1: Unit Test the Calculator ```bash # Build and run test npm run build node tmp/test-readability.js # Verify scores are in 40-60 range ``` #### Test 2: Database Verification After Re-indexing ```bash # Clear and re-index with new readability rm -rf /Users/hanan/Projects/folder-mcp/.folder-mcp npm run daemon:restart & # Wait for indexing to complete (monitor logs) sleep 30 # Check new scores sqlite3 /Users/hanan/Projects/folder-mcp/.folder-mcp/embeddings.db \ "SELECT AVG(readability_score) as avg, MIN(readability_score) as min, MAX(readability_score) as max, COUNT(*) as total FROM chunks WHERE readability_score IS NOT NULL" # Expected: avg ~48-52, min ~40, max ~60 ``` #### Test 3: Verify Score Distribution ```bash # Check score distribution makes sense sqlite3 /Users/hanan/Projects/folder-mcp/.folder-mcp/embeddings.db \ "SELECT CASE WHEN readability_score < 45 THEN '40-45 (Simple)' WHEN readability_score < 50 THEN '45-50 (Medium)' WHEN readability_score < 55 THEN '50-55 (Complex)' ELSE '55-60 (Very Complex)' END as complexity, COUNT(*) as count FROM chunks WHERE readability_score IS NOT NULL GROUP BY complexity" # Should see reasonable distribution across ranges ``` #### Test 4: Compare Known Documents ```bash # Test specific documents we know sqlite3 /Users/hanan/Projects/folder-mcp/.folder-mcp/embeddings.db \ "SELECT d.file_path, AVG(c.readability_score) as avg_readability, COUNT(c.id) as chunks FROM documents d JOIN chunks c ON d.id = c.document_id WHERE d.file_path LIKE '%README.md%' OR d.file_path LIKE '%epic.md%' GROUP BY d.file_path" # README.md should score ~45-48 (accessible) # epic.md should score ~50-55 (technical planning) ``` #### Test 5: MCP Endpoint Verification ```bash # Test that MCP returns readability in metadata folder-mcp mcp server << 'EOF' { "jsonrpc": "2.0", "method": "search", "params": { "query": "semantic extraction", "folder_path": "/Users/hanan/Projects/folder-mcp" }, "id": 1 } EOF # Check that results include readability_score in metadata ``` ## Success Criteria 1. **Score Range**: All documents score between 40-60 2. **Distribution**: README files score 40-48, technical docs 48-55, complex docs 55-60 3. **Performance**: <5ms calculation time per chunk 4. **Consistency**: Similar content gets similar scores across models 5. **MCP Integration**: Readability scores appear in search metadata ## Rollback Plan If issues arise: 1. Readability is metadata only - doesn't affect search functionality 2. Can set all scores to default 50 as emergency fix 3. Previous scores can be restored from backup ## Time Estimate - **Implementation**: 30 minutes - **Testing**: 30 minutes - **Integration**: 30 minutes - **Validation**: 30 minutes - **Total**: ~2 hours ## Next Steps After Completion Once readability is fixed, move to **Sprint 3: BERTopic** for high-impact topic extraction improvements. --- ## Quick Reference Commands ```bash # Full test cycle rm -rf /Users/hanan/Projects/folder-mcp/.folder-mcp npm run build npm run daemon:restart & sleep 30 sqlite3 /Users/hanan/Projects/folder-mcp/.folder-mcp/embeddings.db \ "SELECT AVG(readability_score), MIN(readability_score), MAX(readability_score) FROM chunks WHERE readability_score IS NOT NULL" ``` This is a simple, testable approach that can be completed in 2-3 hours with clear validation steps.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/okets/folder-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

semantic-data-extraction-sprint-2.md•8.62 KiB