MCP Codebase Index

Overview Schema Related Servers Score Discussions

VECTOR_VISUALIZATION.md•22.7 KiB

# Vector Visualization Guide > **Explore your codebase in 2D/3D space** - See how your code is semantically organized using AI embeddings and UMAP dimensionality reduction. --- ## 📖 Table of Contents - [Overview](#overview) - [Quick Start](#quick-start) - [Visualization Tools](#visualization-tools) - [visualize_collection](#visualize_collection) - [visualize_query](#visualize_query) - [export_visualization_html](#export_visualization_html) - [Understanding the Visualization](#understanding-the-visualization) - [Use Cases](#use-cases) - [Technical Details](#technical-details) - [Troubleshooting](#troubleshooting) - [Best Practices](#best-practices) --- ## Overview ### What is Vector Visualization? Vector visualization transforms your codebase's **768-dimensional embeddings** into **2D or 3D space** using UMAP (Uniform Manifold Approximation and Projection). This allows you to: - **See semantic relationships** - Code with similar meaning clusters together - **Understand codebase structure** - Identify modules, patterns, and outliers - **Debug search results** - Visualize why certain results were retrieved - **Explore code organization** - Find related code visually ### How It Works ``` Your Code Files ↓ Gemini Embeddings (768 dimensions) ↓ Qdrant Vector Database ↓ UMAP Dimensionality Reduction ↓ 2D/3D Interactive Visualization ``` **Key Benefits:** - 🎨 **Interactive** - Click, zoom, pan, hover for details - 🔍 **Semantic** - Similar code appears close together - 📊 **Insightful** - Understand your codebase at a glance - 💾 **Exportable** - Save as standalone HTML files --- ## Quick Start ### 1. Visualize Your Entire Codebase Ask GitHub Copilot: ``` "Visualize my codebase" "Show me how my code is organized" "Create a 2D visualization of the vector space" ``` **Result:** You'll see clusters of similar code grouped by functionality.  ### 2. Visualize Search Results Ask GitHub Copilot: ``` "Visualize authentication code" "Show me where error handling is located" "Visualize query: database connections" ``` **Result:** You'll see the query point (red) and matching code (green) in context.  ### 3. Export as Interactive HTML Ask GitHub Copilot: ``` "Export codebase visualization as HTML" "Create an interactive visualization file" "Save visualization with clustering enabled" ``` **Result:** Standalone HTML file you can open in any browser.  --- ## Visualization Tools ### `visualize_collection` **Purpose:** Visualize the entire vector database to understand overall codebase structure. **Parameters:** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `dimensions` | 2 or 3 | 2 | Number of dimensions for visualization | | `enableClustering` | boolean | true | Group similar code into clusters | | `maxVectors` | number | 1000 | Maximum vectors to visualize (100-5000) | | `format` | string | "summary" | Output format: "summary", "plotly", "json" | **Example Usage:** ``` User: "Visualize my codebase with clustering" → Uses: dimensions=2, enableClustering=true, maxVectors=1000, format="summary" User: "Show me a 3D visualization" → Uses: dimensions=3, enableClustering=true, maxVectors=1000, format="summary" User: "Visualize 2000 vectors without clusters" → Uses: dimensions=2, enableClustering=false, maxVectors=2000, format="summary" ``` **Output Formats:** 1. **"summary"** (Recommended for AI interpretation) - Human-readable text description - Cluster analysis with top terms - Statistics and metadata - Performance metrics 2. **"plotly"** (For interactive visualization) - Plotly JSON format - Can be rendered in browser - Interactive hover, zoom, pan 3. **"json"** (For programmatic use) - Complete structured data - All points, clusters, metadata - For custom processing **Sample Output (Summary Format):** ``` 📊 CODEBASE VISUALIZATION SUMMARY Overview: - Total Vectors: 847 - Dimensions: 768 → 2D (UMAP reduction) - Clustering: Enabled (5 clusters detected) 🎯 Semantic Clusters: Cluster 1 (234 vectors, 27.6%) - "API Controllers & Routes" Top terms: controller, route, api, endpoint, request Representative file: src/api/controllers/userController.ts Insight: HTTP endpoint handlers and routing logic Cluster 2 (189 vectors, 22.3%) - "Database Models & Schemas" Top terms: model, schema, database, query, entity Representative file: src/models/User.ts Insight: Data models and database interactions Cluster 3 (156 vectors, 18.4%) - "Authentication & Security" Top terms: auth, token, jwt, password, verify Representative file: src/auth/authService.ts Insight: User authentication and authorization Cluster 4 (142 vectors, 16.8%) - "Utility Functions" Top terms: util, helper, format, validate, parse Representative file: src/utils/helpers.ts Insight: Shared utility and helper functions Cluster 5 (126 vectors, 14.9%) - "Test Suites" Top terms: test, expect, mock, describe, it Representative file: tests/unit/userController.test.ts Insight: Unit and integration tests 📈 Performance: - UMAP reduction: 2.3s - Clustering: 0.4s - Total time: 2.8s ``` ### `visualize_query` **Purpose:** Visualize a search query and its results to understand why certain code was retrieved. **Parameters:** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `query` | string | (required) | The search query to visualize | | `dimensions` | 2 or 3 | 2 | Number of dimensions | | `topK` | number | 10 | Number of top results to highlight (1-50) | | `enableClustering` | boolean | true | Group similar code into clusters | | `maxVectors` | number | 500 | Background vectors to show (100-2000) | | `format` | string | "summary" | Output format: "summary", "plotly", "json" | **Example Usage:** ``` User: "Visualize authentication logic" → Uses: query="authentication logic", topK=10, format="summary" User: "Show me where database code is located" → Uses: query="database code", topK=10, format="summary" User: "Visualize error handling with 20 results" → Uses: query="error handling", topK=20, format="summary" ``` **How to Interpret:** - **Red Diamond (◆)**: Your query point in the embedding space - **Green Points (●)**: Retrieved/relevant code chunks - **Gray Points (●)**: Background codebase for context - **Distance**: Closer points = more semantically similar **Sample Output (Summary Format):** ``` 🔍 QUERY VISUALIZATION SUMMARY Query: "authentication logic" Retrieved: 10 most similar code chunks 📍 Query Position: - Embedded in 768D space - Reduced to 2D coordinates: (12.4, -8.7) - Located in Cluster 3: "Authentication & Security" 🎯 Top 10 Retrieved Results: 1. src/auth/authService.ts:45-78 (Similarity: 95.2%) Function: validateToken Distance: 0.12 Why retrieved: Direct authentication token validation 2. src/auth/authMiddleware.ts:12-34 (Similarity: 92.8%) Function: requireAuth Distance: 0.18 Why retrieved: Authentication middleware check 3. src/api/controllers/authController.ts:89-124 (Similarity: 91.3%) Function: login Distance: 0.21 Why retrieved: User login authentication flow [... 7 more results ...] 🎨 Cluster Distribution: - Cluster 3 (Authentication): 7 results (70%) - Cluster 1 (API Controllers): 2 results (20%) - Cluster 4 (Utilities): 1 result (10%) 💡 Insight: Most results cluster around authentication code (Cluster 3), with some overlap to API controllers. The query successfully targeted the auth module. 📈 Performance: - Query embedding: 0.3s - Vector search: 0.1s - UMAP reduction: 1.8s - Total time: 2.2s ``` ### `export_visualization_html` **Purpose:** Export visualization as a standalone interactive HTML file. **Parameters:** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `dimensions` | 2 or 3 | 2 | Number of dimensions | | `enableClustering` | boolean | true | Group similar code into clusters | | `maxVectors` | number | 1000 | Maximum vectors to visualize | | `outputPath` | string | (auto) | Custom output path (optional) | **Example Usage:** ``` User: "Export visualization as HTML" → Auto-generates: codebase-viz-<collection>-<timestamp>.html User: "Create interactive visualization file" → Same as above User: "Export to /path/to/viz.html" → Saves to specified path ``` **Generated HTML Features:** - ✅ **Standalone** - No internet required, embeds Plotly.js - ✅ **Interactive** - Hover, zoom, pan, click - ✅ **Modern UI** - Gradient design, collapsible sidebar - ✅ **Cluster Info** - Click clusters to highlight - ✅ **Vector Details** - Click points for code info - ✅ **Export Tools** - Save as PNG, reset view - ✅ **Responsive** - Works on all screen sizes **HTML Structure:** ```html <!DOCTYPE html> <html> <head> <title>Codebase Visualization</title> <script src="plotly-2.35.2.min.js"></script> <style>/* Modern gradient UI */</style> </head> <body> <div class="header"> <h1>🔍 Codebase Vector Visualization</h1> <div class="stats">...</div> </div> <div class="content"> <div id="visualization"></div> <div class="sidebar"> <h3>📊 Semantic Clusters</h3>  </div> </div> <div class="guide"> <h3>💡 How to Use</h3>  </div> </body> </html> ``` **Opening the File:** ```bash # macOS open codebase-viz-myproject-2025-11-18T10-30-00.html # Linux/WSL xdg-open codebase-viz-myproject-2025-11-18T10-30-00.html # Windows start codebase-viz-myproject-2025-11-18T10-30-00.html ``` --- ## Understanding the Visualization ### What Do the Colors Mean? **In Collection Visualization:** - Each cluster has a unique color (blue, orange, green, red, purple, etc.) - Points in the same cluster share similar semantic meaning - Outliers (unclustered) are shown in gray ### What Do Clusters Represent? Clusters are groups of code chunks with similar semantic meaning, typically representing: - **Functional modules** (e.g., authentication, database, API) - **Code patterns** (e.g., controllers, models, tests) - **Programming paradigms** (e.g., OOP classes, functional utilities) - **Technology stacks** (e.g., React components, Node.js services) ### Why Is My Code Positioned Here? Position in the visualization reflects **semantic similarity**: - **Close together** = Similar in meaning/purpose - **Far apart** = Different functionality - **In same cluster** = Part of the same module/pattern - **Between clusters** = Shared concepts/utilities ### Reading the Distances Distance metrics in the visualization: ``` Distance < 0.3 → Very similar (same function family) Distance 0.3-0.6 → Related (same module) Distance 0.6-1.0 → Loosely related (some overlap) Distance > 1.0 → Unrelated (different domains) ``` --- ## Use Cases ### 1. Understanding Codebase Architecture **Scenario:** You joined a new project and want to understand its structure. **Steps:** 1. Ask Copilot: *"Visualize my codebase with clustering"* 2. Look at the clusters - they represent major modules 3. Check cluster labels for module descriptions 4. Identify the core vs. utility code **Example Insight:** ``` "Your codebase has 5 main modules: 1. API layer (28%) 2. Database layer (23%) 3. Authentication (19%) 4. Business logic (18%) 5. Tests (12%) The API and database layers are closely connected, suggesting tight integration." ```  ### 2. Finding Related Code **Scenario:** You need to modify authentication but aren't sure where all related code is. **Steps:** 1. Ask Copilot: *"Visualize query: authentication"* 2. Look at the green highlighted points 3. Check which clusters they belong to 4. Navigate to those files **Example Insight:** ``` "Authentication code is primarily in Cluster 3, but also touches Cluster 1 (API endpoints) and Cluster 4 (utilities). You'll need to update files in all three areas." ``` ### 3. Debugging Search Results **Scenario:** Search returned unexpected results. **Steps:** 1. Ask Copilot: *"Visualize query: [your search]"* 2. Check where the query point (red) landed 3. See which clusters the results (green) came from 4. Understand why those results were retrieved **Example Insight:** ``` "Your query 'user profile' landed between Cluster 1 (API) and Cluster 2 (Database), which explains why results came from both layers. To focus on API only, try 'user profile API endpoint'." ``` ### 4. Identifying Outliers **Scenario:** Find code that doesn't fit the project structure. **Steps:** 1. Visualize codebase with clustering 2. Look for gray points (unclustered) 3. Check points far from any cluster 4. Review those files for refactoring **Example Insight:** ``` "File: legacy/oldAuth.js is an outlier, positioned far from the main authentication cluster. Consider migrating to the new auth system." ``` ### 5. Validating Refactoring **Scenario:** You refactored code and want to verify it's properly organized. **Steps:** 1. Visualize before refactoring (save HTML) 2. Perform refactoring 3. Visualize after refactoring 4. Compare the cluster organization **Example Insight:** ``` Before: Authentication code scattered across 3 clusters After: Authentication code unified in 1 tight cluster ✅ Refactoring improved code cohesion ``` ### 6. Onboarding New Developers **Scenario:** Help new team members understand the codebase. **Steps:** 1. Export visualization as HTML 2. Share with team member 3. Use clusters as a visual guide 4. Explain each cluster's purpose **Example Insight:** ``` "Here's our codebase structure: - Blue cluster: Frontend components - Orange cluster: API services - Green cluster: Database models - Red cluster: Authentication - Purple cluster: Tests Start with the blue cluster to understand the UI." ``` --- ## Technical Details ### UMAP Algorithm **What is UMAP?** UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that: - Preserves both **local** and **global** structure - Faster than t-SNE - Better at preserving distances - Ideal for high-dimensional embeddings (768D → 2D/3D) **Parameters Used:** ```typescript { nNeighbors: 15, // Balance local/global structure minDist: 0.1, // Minimum distance between points spread: 1.0, // Scale of embedded points nComponents: 2 or 3 // Output dimensions } ``` **Why These Parameters?** - `nNeighbors=15`: Good balance for code similarity - `minDist=0.1`: Allows tight clusters without overlap - `spread=1.0`: Natural scaling for visualization ### K-Means Clustering **Algorithm:** Standard k-means with automatic cluster detection **Cluster Count Detection:** ```typescript // Use elbow method to find optimal k const maxClusters = Math.min(10, Math.floor(points.length / 50)); const k = findOptimalK(points, maxClusters); ``` **Cluster Labeling:** - Extract top terms from code chunks in each cluster - Use term frequency to identify cluster themes - Generate descriptive labels automatically ### Vector Sampling For large collections (>5000 vectors), we use smart sampling: **Strategy:** ```typescript if (totalVectors > maxVectors) { // Sample evenly across the collection const step = Math.floor(totalVectors / maxVectors); vectors = vectors.filter((_, i) => i % step === 0); } ``` **Why Sample?** - UMAP performance: O(n²) for large datasets - Visualization clarity: Too many points = cluttered - Balance: Show overall structure without overwhelming ### Performance Metrics **Typical Processing Times:** | Operation | 100 vectors | 1000 vectors | 5000 vectors | |-----------|-------------|--------------|--------------| | UMAP 2D | 0.5s | 2.5s | 15s | | UMAP 3D | 0.8s | 3.5s | 22s | | Clustering | 0.1s | 0.5s | 2.5s | | Rendering | 0.2s | 0.8s | 3.5s | | **Total** | **0.8s** | **4.3s** | **24s** | **Optimization Tips:** - Use 2D instead of 3D (40% faster) - Limit maxVectors to 1000-2000 for best performance - Enable clustering for better organization ### Data Flow ``` 1. Fetch vectors from Qdrant ↓ 2. Sample if needed (>maxVectors) ↓ 3. UMAP dimensionality reduction (768D → 2D/3D) ↓ 4. K-means clustering (optional) ↓ 5. Generate Plotly traces ↓ 6. Export to format (summary/plotly/json) ↓ 7. Return to user via MCP ``` --- ## Troubleshooting ### "umap-js not available" **Problem:** UMAP library not installed. **Solution:** ```bash npm install umap-js ``` **Why it happens:** UMAP is an optional dependency for visualization. ### Visualization Takes Too Long **Problem:** UMAP processing is slow for large collections. **Solution:** - Reduce `maxVectors` to 500-1000 - Use 2D instead of 3D - Disable clustering if not needed **Example:** ``` User: "Visualize 500 vectors in 2D without clustering" → Much faster than default 1000 vectors with clustering ``` ### Clusters Don't Make Sense **Problem:** Cluster labels are generic or unclear. **Possible Causes:** 1. **Too few vectors per cluster** - Increase `maxVectors` 2. **Code is very similar** - Normal for small projects 3. **Too many clusters** - Algorithm splits too finely **Solution:** - Try different `maxVectors` values - Review actual cluster members to understand grouping - Consider that similar code will cluster together (expected) ### Query Point Not Where Expected **Problem:** Query lands in unexpected location. **Why it happens:** - Query embedding reflects semantic meaning - May land between clusters if query is ambiguous - Multiple valid interpretations possible **Solution:** - Make query more specific - Check retrieved results - they're still ranked by relevance - Use this insight to refine your search query ### HTML Export Opens But Doesn't Render **Problem:** HTML file opens but visualization is blank. **Possible Causes:** 1. **Browser security** - Local file restrictions 2. **JavaScript disabled** - Required for Plotly 3. **File corruption** - Download interrupted **Solution:** ```bash # Serve via local HTTP server cd /path/to/html/directory python3 -m http.server 8000 # Open in browser open http://localhost:8000/codebase-viz-xxx.html ``` --- ## Best Practices ### 1. Start with Default Settings **Recommended first visualization:** ``` "Visualize my codebase" ``` This uses optimal defaults (2D, 1000 vectors, clustering enabled). ### 2. Use Summary Format for Understanding **For AI interpretation:** ``` "Visualize my codebase" → Returns summary text ``` **For interactive exploration:** ``` "Export visualization as HTML" → Returns interactive HTML ``` ### 3. Adjust maxVectors Based on Project Size **Project Size Guidelines:** | Project Size | Files | Chunks | Recommended maxVectors | |--------------|-------|--------|------------------------| | Small | <50 | <500 | 500 | | Medium | 50-200 | 500-2000 | 1000 | | Large | 200-1000 | 2000-10000 | 2000 | | Very Large | >1000 | >10000 | 3000 | ### 4. Use Query Visualization for Specific Questions **Good queries:** ``` ✅ "Visualize authentication logic" ✅ "Show me database connection code" ✅ "Visualize error handling patterns" ``` **Less useful queries:** ``` ❌ "Visualize code" → Too vague ❌ "Show me everything" → Use collection visualization ❌ "Visualize file.ts" → Use search instead ``` ### 5. Export HTML for Presentations **Use cases:** - Architecture reviews - Team onboarding - Code audits - Documentation **Tips:** - Use descriptive output paths - Include timestamp in filename - Save before major refactoring (for comparison) ### 6. Combine with Other Tools **Workflow example:** ``` 1. Visualize codebase → Identify modules 2. Search specific module → Find relevant code 3. Visualize query → Understand why retrieved 4. Implement change 5. Visualize again → Verify organization ``` ### 7. Interpret Clusters Contextually **Remember:** - Clusters reflect semantic similarity, not file structure - Cross-cutting concerns (logging, error handling) may scatter - Utility code often forms separate clusters - Test code typically clusters separately ### 8. Use 3D Sparingly **When to use 3D:** - Very large codebases with many distinct modules - When 2D visualization has too much overlap - For presentations/demos (more impressive) **When to use 2D:** - Most cases (easier to interpret) - Faster processing - Better for screenshots/documentation ### 9. Regular Visualization Reviews **Recommended schedule:** - After major refactoring - Monthly for active projects - When onboarding new developers - During architecture reviews ### 10. Save Visualization History **Track evolution:** ```bash codebase-viz-2025-01-15.html # Before refactoring codebase-viz-2025-02-15.html # After refactoring codebase-viz-2025-03-15.html # After feature additions ``` Compare over time to see: - Architecture improvements - Code organization changes - Module growth patterns --- ## Advanced Topics ### Custom Clustering While not directly configurable via MCP tools, the clustering algorithm automatically adapts: - **Small collections (<100 vectors)**: 2-3 clusters - **Medium collections (100-1000)**: 3-7 clusters - **Large collections (>1000)**: 5-10 clusters ### Interpreting Outliers Outliers (unclustered points) may indicate: 1. **Unique functionality** - Specialized code not fitting patterns 2. **Legacy code** - Old code using different patterns 3. **External dependencies** - Third-party integrations 4. **Configuration files** - Non-code text files ### Combining Multiple Queries **Workflow for complex understanding:** ``` 1. "Visualize authentication" 2. "Visualize database" 3. "Visualize the overlap between auth and database" ``` This reveals how modules interact. ### Performance Tuning **For large codebases (>10,000 vectors):** ```javascript // Recommended approach maxVectors: 2000, // Sample to 2000 dimensions: 2, // Faster than 3D enableClustering: true // Still useful for organization ``` **For real-time updates:** ```javascript maxVectors: 500, // Quick processing dimensions: 2, enableClustering: false // Skip for speed ``` --- ## Related Documentation - **[Main README](../../README.md)** - Project overview - **[Prompt Enhancement Guide](./PROMPT_ENHANCEMENT_GUIDE.md)** - Query improvement - **[Testing Guide](./TEST_SEARCH.md)** - Search functionality - **[Source Code Structure](../../src/README.md)** - Code organization - **[Changelog](../CHANGELOG.md)** - Version history --- ## Feedback and Support Have questions or suggestions about vector visualization? - **Issues:** [GitHub Issues](https://github.com/NgoTaiCo/mcp-codebase-index/issues) - **Discussions:** [GitHub Discussions](https://github.com/NgoTaiCo/mcp-codebase-index/discussions) - **Email:** ngotaico.flutter@gmail.com --- **Last Updated:** v1.5.4-beta.19 (November 18, 2025)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/NgoTaiCo/mcp-codebase-index'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

VECTOR_VISUALIZATION.md•22.7 KiB