MCP Power - Knowledge Search Server

Overview Schema Related Servers Score Discussions

mcpower
specs
002-dataset-creation

spec.md

spec.md•11.5 KiB

# Feature Spec: Dataset Creation & Indexing Tools **Feature ID**: 002-dataset-creation **Status**: Draft **Created**: 2025-11-02 **Owner**: Development Team --- ## 1. Overview ### Problem Statement Currently, users must manually: - Create FAISS indexes using Python scripts - Generate embeddings with sentence-transformers - Create manifest.json and metadata.json files - Organize files in the correct directory structure This creates a significant barrier to entry and makes the MCP server difficult to use. ### Solution Summary Provide CLI tooling to automatically index documents and create ready-to-use datasets: - `mcpower index` command to create datasets from files/folders - Python utilities for embedding generation - Automatic manifest and metadata generation - Support for multiple document formats (txt, md, pdf, docx) ### Success Criteria - Users can create a dataset with a single command - Indexing process completes in <5 minutes for 1000 documents - Clear progress feedback during indexing - Generated datasets work seamlessly with existing search functionality - Comprehensive documentation and examples --- ## 2. User Stories ### US-001: Index Local Documents **As a** developer **I want to** index a folder of markdown files **So that** I can search them using the MCP server **Acceptance Criteria:** - Can run `mcpower index ./docs --dataset my-docs --name "My Docs"` - Process recursively finds all supported files - Generates embeddings for each document - Creates manifest.json and metadata.json - Shows progress bar during indexing - Outputs dataset ready for use ### US-002: Choose Embedding Model **As a** power user **I want to** specify which embedding model to use **So that** I can optimize for quality vs speed **Acceptance Criteria:** - Can specify model via `--model` flag - Supports common models: all-MiniLM-L6-v2, all-mpnet-base-v2, etc. - Shows model info and estimated time - Defaults to balanced model (all-MiniLM-L6-v2) ### US-003: Update Existing Dataset **As a** user **I want to** add documents to an existing dataset **So that** I can incrementally update my knowledge base **Acceptance Criteria:** - Can run `mcpower index ./new-docs --dataset existing-docs --append` - Preserves existing documents - Updates metadata with new documents - Rebuilds FAISS index with all documents ### US-004: Document Chunking **As a** user indexing large documents **I want** automatic chunking of long documents **So that** search results are more precise **Acceptance Criteria:** - Automatically chunks documents over 512 tokens - Preserves context with overlapping chunks - Tracks original document in metadata - Configurable chunk size via `--chunk-size` --- ## 3. Technical Design ### 3.1 Architecture ``` ┌─────────────────────────────────────────┐ │ mcpower index CLI │ │ (src/commands/index.ts) │ └────────────┬────────────────────────────┘ │ ├─→ Document Discovery │ - Recursive file finder │ - File type detection │ - Content extraction │ ├─→ Python Indexing Bridge │ (python/indexer.py) │ - Embedding generation │ - FAISS index creation │ - Batch processing │ └─→ Metadata Generation - manifest.json - metadata.json - Dataset stats ``` ### 3.2 CLI Interface ```bash # Basic usage mcpower index <source-path> --dataset <dataset-id> # Full options mcpower index <source-path> \ --dataset <dataset-id> \ --name <display-name> \ --description <description> \ --model <embedding-model> \ --chunk-size <tokens> \ --chunk-overlap <tokens> \ --file-types <extensions> \ --output <datasets-dir> \ --append \ --verbose # Examples mcpower index ./docs --dataset my-docs --name "My Documentation" mcpower index ./notes --dataset notes --model all-mpnet-base-v2 mcpower index ./new-docs --dataset existing --append ``` ### 3.3 Python Indexer **File**: `python/indexer.py` Key components: - `DocumentLoader` - Extracts text from various formats - `Chunker` - Splits documents into semantic chunks - `EmbeddingGenerator` - Uses sentence-transformers - `IndexBuilder` - Creates FAISS index - `MetadataWriter` - Generates JSON files ```python class DatasetIndexer: def __init__(self, model_name: str = "all-MiniLM-L6-v2"): self.model = SentenceTransformer(model_name) def index_documents( self, source_path: Path, dataset_id: str, output_dir: Path, chunk_size: int = 512, chunk_overlap: int = 50 ) -> DatasetStats: """ Index documents and create dataset. Returns: DatasetStats with document count, embedding dimensions, etc. """ pass ``` ### 3.4 File Format Support **Priority 1 (MVP)**: - `.txt` - Plain text - `.md` - Markdown **Priority 2**: - `.pdf` - PDF documents (via PyPDF2) - `.docx` - Word documents (via python-docx) - `.html` - HTML pages (via BeautifulSoup) **Priority 3**: - `.rst` - ReStructuredText - `.ipynb` - Jupyter notebooks - `.json` - JSON documents ### 3.5 Data Structures **Manifest Schema** (unchanged): ```typescript interface DatasetManifest { id: string; name: string; description: string; index: string; metadata: string; defaultTopK: number; createdAt?: string; updatedAt?: string; documentCount?: number; embeddingDimensions?: number; model?: string; } ``` **Enhanced Metadata Schema**: ```typescript interface EnhancedMetadata { documents: Array<{ id: string; title: string; path: string; text: string; chunk?: number; // NEW: chunk index if document was split chunkTotal?: number; // NEW: total chunks from this document originalDoc?: string; // NEW: original doc ID if chunked fileType?: string; // NEW: original file type size?: number; // NEW: file size in bytes createdAt?: string; // NEW: indexing timestamp }>; stats?: { // NEW: dataset statistics totalDocuments: number; totalChunks: number; embeddingDimensions: number; model: string; indexedAt: string; }; } ``` --- ## 4. Implementation Plan ### Phase 1: Basic Indexing (Priority: High) **Goal**: Index text and markdown files **Tasks**: - T048: Create `src/commands/index.ts` CLI command - T049: Implement `python/indexer.py` core functionality - T050: Add document discovery (txt, md files) - T051: Generate embeddings using sentence-transformers - T052: Create FAISS index - T053: Write manifest.json and metadata.json - T054: Add progress feedback and logging - T055: Unit tests for indexer components - T056: Integration test: index sample docs - T057: Documentation: indexing guide **Estimated effort**: 2-3 days ### Phase 2: Document Chunking (Priority: Medium) **Goal**: Handle large documents effectively **Tasks**: - T058: Implement chunking strategy - T059: Add chunk overlap configuration - T060: Track chunk relationships in metadata - T061: Test with large documents (>10k tokens) - T062: Update search results to show chunk context **Estimated effort**: 1-2 days ### Phase 3: Extended Format Support (Priority: Medium) **Goal**: Support PDF and DOCX files **Tasks**: - T063: Add PDF text extraction (PyPDF2) - T064: Add DOCX text extraction (python-docx) - T065: Add HTML text extraction (BeautifulSoup) - T066: Test with real-world documents - T067: Document supported formats **Estimated effort**: 1-2 days ### Phase 4: Dataset Management (Priority: Low) **Goal**: Update and manage existing datasets **Tasks**: - T068: Implement `--append` mode - T069: Add `mcpower dataset list` command - T070: Add `mcpower dataset info <id>` command - T071: Add `mcpower dataset delete <id>` command - T072: Add `mcpower dataset validate <id>` command **Estimated effort**: 1 day --- ## 5. Testing Strategy ### 5.1 Unit Tests - Document loader for each file type - Chunking algorithm with various sizes - Embedding generation - FAISS index creation - Metadata generation ### 5.2 Integration Tests - End-to-end indexing of sample dataset - Verify indexed dataset works with search - Test append mode - Test large dataset (1000+ documents) ### 5.3 Performance Tests - Index 1000 documents in <5 minutes - Memory usage stays under 2GB - Generated index loads in <1 second ### 5.4 Manual Testing - Index real documentation (e.g., Next.js docs) - Search indexed dataset from MCP client - Verify results are relevant --- ## 6. Documentation Requirements ### 6.1 User Guide - How to index your documents - Choosing the right embedding model - Chunking strategies for large documents - Troubleshooting indexing issues ### 6.2 CLI Reference - `mcpower index` command documentation - All flags and options explained - Common usage examples ### 6.3 Developer Guide - How the indexing pipeline works - Adding support for new file formats - Customizing embedding models --- ## 7. Dependencies ### New Python Dependencies ``` sentence-transformers>=2.2.0 faiss-cpu>=1.7.4 typer>=0.9.0 rich>=13.0.0 PyPDF2>=3.0.0 # Phase 3 python-docx>=1.0.0 # Phase 3 beautifulsoup4>=4.12.0 # Phase 3 ``` ### New TypeScript Dependencies ``` cli-progress # Progress bars chalk # Terminal colors ``` --- ## 8. Security Considerations ### Input Validation - Validate file paths to prevent directory traversal - Limit maximum file size (default 50MB) - Sanitize dataset IDs (alphanumeric + hyphens only) - Validate file types by content, not just extension ### Resource Limits - Maximum documents per batch (default 100) - Memory limit for embedding generation - Timeout for long-running operations - Disk space checks before indexing ### Privacy - No external API calls (all local processing) - No telemetry or tracking - User documents never leave machine --- ## 9. Open Questions 1. **Model storage**: Should we bundle a default model or require users to download? - **Recommendation**: Download on first use with clear prompt 2. **Incremental indexing**: Should we support updating individual documents? - **Recommendation**: Phase 4 feature, start with full rebuild 3. **Multi-language support**: Should we auto-detect language and use language-specific models? - **Recommendation**: Start with English, add language detection in future 4. **Cloud sources**: Should we support indexing from URLs or cloud storage? - **Recommendation**: Start with local files, add cloud support later --- ## 10. Future Enhancements - Web crawler to index entire websites - Git integration to index repository history - Automatic re-indexing on file changes (watch mode) - Dataset merging and splitting - Embedding model fine-tuning utilities - Multi-modal support (images, code) - Distributed indexing for massive datasets --- ## 11. Success Metrics - **Usability**: Users can create first dataset in <5 minutes - **Performance**: Index 1000 docs in <5 minutes - **Quality**: Search precision >0.8 on sample queries - **Adoption**: >50% of users create their own datasets (vs using sample) - **Support**: <10% of questions are about indexing issues --- ## 12. References - [FAISS Documentation](https://github.com/facebookresearch/faiss/wiki) - [Sentence Transformers](https://www.sbert.net/) - [MCP Protocol Spec](https://modelcontextprotocol.io/) - Original PLAN.md Section 2 (Dataset Registry Contract)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/wspotter/mcpower'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

spec.md•11.5 KiB