Semantic Search MCP

weighted-embedding-strategy.md•4.13 KiB

# Weighted Embedding Strategy ## Problem Statement The semantic search indexer creates embeddings from entire markdown file content. This leads to suboptimal search results because: 1. **Signal dilution**: A 2000-word document might have only 10-20 words (filename, title, tags) that truly describe its topic. These meaningful words get drowned by body content that may mention many tangential topics. 2. **Relevant pages rank low**: Files with highly descriptive filenames and metadata often don't appear at the top of results because their semantic signal is diluted. 3. **Long documents dominate unpredictably**: More words create more "semantic surface area" but also more noise. ## Solution: Weighted Text Concatenation Instead of embedding raw file content, we construct a weighted text representation by repeating important components before creating the embedding. ### How It Works When SentenceTransformers encodes text, words appearing multiple times have proportionally more influence on the resulting vector direction. By repeating high-value components, we shift the embedding toward the true topic. ### Input Construction ``` {filename} {filename} {filename} {metadata_title} {metadata_title} {metadata_title} {metadata_tags} {metadata_tags} {first_heading} {first_heading} {body_content_truncated} ``` ### Component Weights | Component | Weight | Rationale | |-----------|--------|-----------| | Filename (without extension) | 3x | Often the most descriptive identifier; users name files meaningfully | | Metadata title | 3x | Explicit topic declaration in frontmatter | | Metadata tags/aliases | 2x | Explicit categorization by the author | | First heading (# Title) | 2x | Usually the document title, reinforces filename | | Body (first 500 words) | 1x | Supporting context, truncated to prevent dilution | ### Example **File**: `API-Authentication-Guide.md` **Frontmatter**: ```yaml --- title: API Authentication Guide tags: [api, authentication, security, oauth] aliases: [Auth Guide] --- ``` **Content**: ```markdown # API Authentication Guide This guide covers how to authenticate with our API... (2000 more words) ``` **Weighted Text for Embedding**: ``` API Authentication Guide API Authentication Guide API Authentication Guide API Authentication Guide API Authentication Guide API Authentication Guide api authentication security oauth api authentication security oauth API Authentication Guide API Authentication Guide This guide covers how to authenticate with our API... (first 500 words of body) ``` ## Benefits 1. **Single embedding per file**: No increase in vector storage or search complexity 2. **No change to search logic**: Same FAISS index, same search API 3. **Tunable**: Weights can be adjusted based on empirical results 4. **Short docs benefit naturally**: Higher signal-to-noise ratio 5. **Long docs don't dominate**: Body truncation prevents dilution ## Trade-offs 1. **Requires re-indexing**: Existing indexes must be rebuilt after implementation 2. **Metadata parsing overhead**: Small increase in indexing time for YAML parsing 3. **Weight tuning**: Optimal weights may vary by corpus; defaults are a starting point ## Implementation Notes ### Filename Processing - Remove file extension (`.md`) - Replace `-` and `_` with spaces - Example: `API-Authentication-Guide.md` → `API Authentication Guide` ### Metadata Extraction Parse YAML frontmatter (content between `---` markers) and extract: - `title`: String value - `tags`: List of strings - `aliases`: List of strings ### Body Truncation - Remove frontmatter before counting - Take first 500 words (split on whitespace) - This captures introduction/summary which typically contains key information ### Heading Extraction - Find first line starting with `# ` (H1 heading) - Strip the `# ` prefix - If no H1 found, skip this component ## Future Enhancements 1. **Configurable weights**: Allow users to tune weights via configuration 2. **Section-aware indexing**: Create separate entries for major sections in long documents 3. **Keyword boosting**: Post-search score adjustment for exact matches in filename/title 4. **Length normalization**: Explicit scoring penalty for very long documents

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/bborbe/semantic-search-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

weighted-embedding-strategy.md•4.13 KiB