folder-mcp

Overview Schema Related Servers Score Discussions

Phase-9-Sprint-13-Keywords-Quality-Enhancement.md•13.7 KiB

# Sprint 13: Keywords Quality Enhancement ## Executive Summary Current keyword extraction produces fragmented, low-quality keywords due to format-agnostic processing. This sprint enhances extraction by leveraging document structure and metadata that we already parse but discard. **Expected Impact:** 50-90% improvement in keyword quality across all file types. ## Problem Analysis ### Current Issues Identified **From ONNX Model Database:** ``` - "**Script Flow" (partial markdown) - "seconds) **Key Points" (broken phrase with punctuation) - "(10 seconds) **Key" (incomplete fragment) - "--- ##" (pure formatting) ``` **From Python GPU Model Database:** ``` - "script flow lecture" (lowercase, generic) - "reasoning forgetful llm" (word salad) - "seconds key points" (timing mixed with headers) ``` ### Root Causes 1. **Text Preprocessing Gaps** - Markdown formatting treated as content - No sentence boundary detection - Punctuation fragments included 2. **Format-Agnostic Processing** - PDF metadata keywords ignored - Document headers treated same as body text - Excel sheet names and headers discarded 3. **Cosine Similarity Bias** - Favors frequently occurring phrases - No structural importance weighting - Generic terms score higher than specific concepts ## Architecture Analysis ### Current Pipeline Flow ``` File Parser → Chunking → Semantic Extraction → Document Aggregation ↓ ↓ ↓ ↓ Extracts rich Plain text Generic n-gram Simple cosine metadata but only extraction similarity discards it ``` ### Integration Points 1. **Parser Stage** (`src/domain/files/parser.ts`) - Already extracts PDF metadata keywords - Already parses Word HTML structure - Already extracts Excel sheet names/headers - Already processes PowerPoint slide titles 2. **Extraction Stage** (`src/domain/semantic/extraction-service.ts`) - Orchestrates keyword extraction - Calls N-gram extractor for ONNX models - Calls KeyBERT for GPU models 3. **Document Stage** (`src/domain/semantic/document-keyword-scorer.ts`) - Aggregates chunk keywords - Applies cosine similarity scoring - Returns top 30 keywords ## Solution Design ### Phase 1: Enhanced ParsedContent Structure **Goal:** Capture structural keyword candidates during parsing **Changes to `src/types/index.ts`:** ```typescript interface ParsedContent { content: string; // Full text (existing) // NEW: Format-specific keyword candidates structuredCandidates?: { metadata?: string[]; // PDF/Word keywords from metadata headers?: string[]; // Document headers (H1-H6, #, ##, ###) entities?: string[]; // Named entities (sheets, slides, tables) emphasized?: string[]; // Bold/italic text captions?: string[]; // Figure/table captions }; // NEW: Content zones with importance weights contentZones?: Array<{ text: string; type: 'title' | 'header1' | 'header2' | 'header3' | 'body' | 'caption' | 'footer'; weight: number; // 0-1 importance score }>; } ``` ### Phase 2: Format-Specific Parser Enhancements #### 2.1 Markdown Files Enhancement (`parseTextFile`) **Extract Headers:** ```typescript private extractMarkdownHeaders(content: string): string[] { const headers: string[] = []; const headerRegex = /^(#{1,6})\s+(.+)$/gm; let match; while ((match = headerRegex.exec(content)) !== null) { headers.push(match[2].trim()); } return headers; } private cleanMarkdownContent(content: string): string { return content .replace(/^#{1,6}\s+/gm, '') // Remove header markers .replace(/\*\*([^*]+)\*\*/g, '$1') // Remove bold formatting .replace(/\*([^*]+)\*/g, '$1') // Remove italic formatting .replace(/```[^`]*```/gs, '') // Remove code blocks .replace(/^---+$/gm, '') // Remove horizontal rules .replace(/^\|.+\|$/gm, '') // Remove table rows } ``` #### 2.2 PDF Files Enhancement (`parsePdfFile`) **Use Existing Metadata:** ```typescript // In PDF metadata extraction (already exists): const structuredCandidates = { metadata: pdfData.Meta?.Keywords ? pdfData.Meta.Keywords.split(/[,;]/).map(k => k.trim()) : [], headers: extractPDFHeaders(pageTexts), // New function }; ``` #### 2.3 Word Documents Enhancement (`parseWordFile`) **Parse HTML Structure:** ```typescript private extractWordHeaders(htmlContent: string): string[] { const headers: string[] = []; const headerRegex = /<h([1-6])[^>]*>([^<]+)<\/h[1-6]>/gi; let match; while ((match = headerRegex.exec(htmlContent)) !== null) { headers.push(match[2].trim()); } return headers; } // Use existing metadata keywords: const metadataKeywords = result.messages .filter(m => m.type === 'info' && m.message.includes('keywords')) .map(m => m.message.split('keywords:')[1]?.trim()) .filter(Boolean); ``` #### 2.4 Excel Files Enhancement (`parseExcelFile`) **Already Extracted - Just Structure It:** ```typescript const structuredCandidates = { entities: workbook.SheetNames, // Sheet names as keywords headers: extractColumnHeaders(worksheets), // First row headers emphasized: extractFormulaReferences(worksheets) // Cell references }; ``` #### 2.5 PowerPoint Enhancement (`parsePowerPointFile`) **Extract Slide Structure:** ```typescript const structuredCandidates = { headers: slides.map(slide => slide.title).filter(Boolean), entities: extractBulletPoints(slides), emphasized: extractSpeakerNotes(slides) }; ``` ### Phase 3: Enhanced Semantic Extraction #### 3.1 Text Preprocessing Service **New: `src/domain/semantic/text-preprocessor.ts`:** ```typescript export class TextPreprocessor { constructor(private fileType: string) {} preprocess(content: string): string { switch (this.fileType) { case '.md': return this.cleanMarkdown(content); case '.pdf': return this.cleanPDF(content); default: return this.cleanGeneric(content); } } private cleanMarkdown(content: string): string { // Remove markdown syntax while preserving content } private cleanPDF(content: string): string { // Remove PDF artifacts (headers, footers, page numbers) } private cleanGeneric(content: string): string { // Basic cleanup (extra whitespace, special characters) } } ``` #### 3.2 Enhanced N-gram Extraction **Modify `src/domain/semantic/algorithms/ngram-cosine-extractor.ts`:** ```typescript export interface EnhancedExtractionOptions extends NGramExtractionOptions { structuredCandidates?: { metadata?: string[]; headers?: string[]; entities?: string[]; emphasized?: string[]; }; contentZones?: Array<{ text: string; type: string; weight: number; }>; } async extractKeyPhrases( text: string, docEmbedding?: Float32Array, options: Partial<EnhancedExtractionOptions> = {} ): Promise<string[]> { // 1. Preprocess text based on format const preprocessor = new TextPreprocessor(options.fileType || '.txt'); const cleanText = preprocessor.preprocess(text); // 2. Start with high-priority structured candidates const priorityCandidates: Array<{text: string, weight: number}> = []; if (options.structuredCandidates?.metadata) { priorityCandidates.push(...options.structuredCandidates.metadata.map(k => ({text: k, weight: 1.0}) )); } if (options.structuredCandidates?.headers) { priorityCandidates.push(...options.structuredCandidates.headers.map(h => ({text: h, weight: 0.9}) )); } // 3. Extract n-grams from clean text const ngrams = extractNGrams(cleanText, opts.ngramRange[0], opts.ngramRange[1]); const candidates = filterCandidates(ngrams); // 4. Combine priority candidates with n-gram candidates const allCandidates = [ ...priorityCandidates, ...candidates.map(c => ({text: c, weight: 0.4})) ]; // 5. Score with weights applied return this.scoreWeightedCandidates(allCandidates, docEmbedding, opts); } ``` ### Phase 4: Enhanced Document Keyword Scoring **Modify `src/domain/semantic/document-keyword-scorer.ts`:** ```typescript export interface WeightedKeywordCandidate extends DocumentKeywordCandidate { sourceType: 'metadata' | 'header' | 'entity' | 'emphasized' | 'ngram'; sourceWeight: number; } export interface EnhancedScoringOptions extends KeywordScoringOptions { structuredCandidates?: { metadata?: string[]; headers?: string[]; entities?: string[]; emphasized?: string[]; }; } scoreAndSelect( documentEmbedding: Float32Array, options: EnhancedScoringOptions = {} ): KeywordScoringResult { // 1. Add structured candidates with high weights if (options.structuredCandidates) { this.addStructuredCandidates(options.structuredCandidates); } // 2. Apply weighted scoring const weights = { metadata: 1.0, // Author-defined keywords header: 0.9, // Document headers entity: 0.8, // Named entities (sheets, slides) emphasized: 0.7, // Bold/italic text ngram: 0.4, // Regular n-gram extraction ...options.weights }; // 3. Calculate weighted final scores for (const candidate of this.candidates.values()) { const sourceWeight = weights[candidate.sourceType] || 0.4; candidate.finalScore = sourceWeight * 0.5 + // Source weight gets 50% weights.documentSimilarity * candidate.documentScore * 0.3 + weights.chunkAverage * candidate.avgChunkScore * 0.2; } // 4. Ensure metadata keywords always make the cut return this.selectWithMinimumMetadata(options); } ``` ## Implementation Tasks ### Task 1: Enhanced ParsedContent Structure (2 hours) - [ ] Update `ParsedContent` interface in types - [ ] Update all parser methods to return structured candidates - [ ] Add content zone extraction for each file type ### Task 2: Format-Specific Parser Enhancements (6 hours) - [ ] **Markdown**: Header extraction, content cleaning (1.5 hours) - [ ] **PDF**: Metadata keyword usage, header detection (1.5 hours) - [ ] **Word**: HTML structure parsing, metadata extraction (1.5 hours) - [ ] **Excel**: Column headers, sheet names (1 hour) - [ ] **PowerPoint**: Slide titles, bullet points (0.5 hours) ### Task 3: Text Preprocessing Service (3 hours) - [ ] Create `TextPreprocessor` class - [ ] Implement format-specific cleaning methods - [ ] Add sentence boundary detection - [ ] Integrate with extraction service ### Task 4: Enhanced N-gram Extraction (4 hours) - [ ] Add structured candidate support to `NGramCosineExtractor` - [ ] Implement weighted candidate scoring - [ ] Add better filtering for incomplete phrases - [ ] Test with real document examples ### Task 5: Enhanced Document Keyword Scoring (3 hours) - [ ] Add weighted scoring to `DocumentKeywordScorer` - [ ] Implement minimum metadata keyword guarantee - [ ] Add source type tracking - [ ] Ensure diversity while preserving quality ### Task 6: Integration & Testing (4 hours) - [ ] Update `IndexingOrchestrator` to pass structured data - [ ] Test with all file types in test databases - [ ] Compare before/after keyword quality - [ ] Performance impact analysis ## Success Criteria ### Quality Metrics - [ ] **Metadata Keywords**: 100% of PDF/Word metadata keywords included - [ ] **Header Recognition**: 90%+ of document headers captured as keywords - [ ] **Format Cleanup**: Zero markdown syntax in final keywords - [ ] **Structural Relevance**: Keywords represent document structure, not fragments ### Performance Metrics - [ ] **Processing Time**: <10% increase in indexing time - [ ] **Memory Usage**: No significant memory impact - [ ] **Database Size**: Minimal impact on storage ### Validation Tests - [ ] **Technical Documents**: Capture API names, method names, technical terms - [ ] **Business Documents**: Extract policy names, section headers, key concepts - [ ] **Presentations**: Slide titles and main bullet points as keywords - [ ] **Spreadsheets**: Sheet names and column headers as domain terminology ## File Changes Required ### Core Files to Modify 1. `src/types/index.ts` - Enhanced ParsedContent interface 2. `src/domain/files/parser.ts` - All parsing methods 3. `src/domain/semantic/extraction-service.ts` - Text preprocessing integration 4. `src/domain/semantic/algorithms/ngram-cosine-extractor.ts` - Structured candidates 5. `src/domain/semantic/document-keyword-scorer.ts` - Weighted scoring 6. `src/application/indexing/orchestrator.ts` - Data flow integration ### New Files to Create 1. `src/domain/semantic/text-preprocessor.ts` - Format-specific cleaning 2. `tests/unit/semantic/text-preprocessor.test.ts` - Preprocessing tests 3. `tests/integration/keyword-quality.test.ts` - End-to-end quality tests ## Risk Assessment ### Technical Risks - **Format Compatibility**: Ensure changes work across all file types - **Performance Impact**: Monitor processing time with large documents - **Edge Cases**: Handle malformed or unusual document structures ### Mitigation Strategies - **Incremental Rollout**: Implement one format at a time - **Fallback Logic**: Maintain existing extraction as backup - **Comprehensive Testing**: Test with diverse real-world documents ## Expected Outcomes ### Immediate Wins (Phase 1-2) - **50-70% improvement** from metadata keywords and header extraction - **Elimination** of markdown formatting artifacts - **Better domain terminology** from structured data ### Full Implementation (All Phases) - **80-90% improvement** in keyword relevance and quality - **Document structure representation** in keywords - **Author intent preservation** through metadata usage - **Format-aware extraction** that leverages document semantics ### Long-term Benefits - **Better search relevance** due to higher quality keywords - **Improved user experience** with more meaningful document discovery - **Foundation for advanced features** like topic modeling and document clustering

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/okets/folder-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

Phase-9-Sprint-13-Keywords-Quality-Enhancement.md•13.7 KiB