IndexFoundry MCP

README.md•13 KiB

# IndexFoundry-MCP **Deterministic Vector Index Factory** - An MCP server for automated, auditable vector database creation from any content source, with deployable project-based workflows. > Tools don't think, they act. Every tool in this server is: - **Deterministic**: Same inputs → same outputs - **Idempotent**: Re-running produces identical artifacts (unless `force: true`) - **Auditable**: Every operation produces manifests, hashes, and logs - **Composable**: Tools can be run independently or chained ## Architecture IndexFoundry provides two complementary workflows: ### 1. Run-Based Pipeline (Fine-Grained Control) Individual pipeline runs with isolated artifacts, suitable for experimentation and detailed auditing. ### 2. Project-Based Workflow (Deployable RAG Applications) Self-contained projects that generate deployment-ready repositories with MCP server, Dockerfile, and Railway configuration. ## Pipeline Phases (Run-Based) ``` Connect → Extract → Normalize → Index → Serve ↓ ↓ ↓ ↓ ↓ raw/ extracted/ normalized/ indexed/ served/ ``` ### Phase 1: Connect Fetch content from URLs, sitemaps, folders, or PDFs. Every artifact gets a content hash. ### Phase 2: Extract Convert raw bytes to text using pinned extractors (pdfminer, cheerio, etc.). ### Phase 3: Normalize Chunk text deterministically, enrich metadata (no LLM), and deduplicate. ### Phase 4: Index Generate embeddings with a pinned model, upsert to vector DB. ### Phase 5: Serve Generate OpenAPI spec and optionally start a retrieval API. ## Quick Start ```bash # Install dependencies npm install # Build npm run build # Run on stdio (for Claude Desktop, Cline, etc.) npm start # Run as HTTP server npm run start:http ``` ## Workflow Options ### Option 1: Run-Based Pipeline (Detailed Control) Use individual pipeline tools for fine-grained control over each phase: ```typescript // Create a new run const runId = crypto.randomUUID(); await client.callTool("indexfoundry_connect_folder", { run_id: runId, path: "/path/to/documents", glob: "**/*.pdf" }); // Extract PDF content await client.callTool("indexfoundry_extract_pdf", { run_id: runId, pdf_path: "raw/<sha256>.pdf", mode: "layout" }); // Chunk: text await client.callTool("indexfoundry_normalize_chunk", { run_id: runId, input_paths: ["extracted/<sha256>.pages.jsonl"], strategy: "recursive", max_chars: 1500, overlap_chars: 150 }); // Generate embeddings await client.callTool("indexfoundry_index_embed", { run_id: runId, model: { provider: "openai", model_name: "text-embedding-3-small", api_key_env: "OPENAI_API_KEY" } }); // Upsert to vector DB await client.callTool("indexfoundry_index_upsert", { run_id: runId, provider: "local", connection: { collection: "my_docs" } }); ``` ### Option 2: Project-Based Workflow (Deployable RAG) Create a self-contained, deployable RAG application: ```typescript // Create a new project await client.callTool("indexfoundry_project_create", { project_id: "my-rag-app", name: "My RAG Search", description: "Searchable knowledge base for documentation", embedding_model: { provider: "openai", model_name: "text-embedding-3-small", api_key_env: "OPENAI_API_KEY" }, chunk_config: { strategy: "recursive", max_chars: 1500, overlap_chars: 150 } }); // Add data sources await client.callTool("indexfoundry_project_add_source", { project_id: "my-rag-app", url: "https://docs.example.com", source_name: "Documentation Site", tags: ["docs", "api"] }); // Build: vector database await client.callTool("indexfoundry_project_build", { project_id: "my-rag-app" }); // Query: built index await client.callTool("indexfoundry_project_query", { project_id: "my-rag-app", query: "How do I configure authentication?", mode: "hybrid", top_k: 5 }); // Export for deployment await client.callTool("indexfoundry_project_export", { project_id: "my-rag-app", server_name: "my-rag-server", include_http: true, railway_config: true }); ``` After export, a project directory contains a complete deployable repository: - `Dockerfile` - Container configuration - `railway.toml` - Railway deployment config - `src/index.ts` - Generated MCP server with search tools - `README.md` - Project-specific documentation Push to GitHub and deploy: ```bash cd projects/my-rag-app git init git add . git commit -m "Initial RAG application" git push # Then connect to Railway and deploy ``` ## Tool Overview ### Run-Based Pipeline Tools #### Connect Phase - `indexfoundry_connect_url` - Fetch a single URL with domain allowlisting - `indexfoundry_connect_sitemap` - Crawl a sitemap with URL filtering - `indexfoundry_connect_folder` - Load local files with glob patterns - `indexfoundry_connect_pdf` - Fetch PDF with metadata extraction #### Extract Phase - `indexfoundry_extract_pdf` - PDF to text (layout/plain/OCR modes) - `indexfoundry_extract_html` - HTML to clean text with structure preservation - `indexfoundry_extract_document` - Generic document extraction (markdown, txt, CSV, JSON) #### Normalize Phase - `indexfoundry_normalize_chunk` - Split text into chunks (recursive/paragraph/heading/page/sentence/fixed) - `indexfoundry_normalize_enrich` - Add metadata (language detection, regex tags, section classification) - `indexfoundry_normalize_dedupe` - Remove duplicates (exact/simhash/minhash) #### Index Phase - `indexfoundry_index_embed` - Generate embeddings (OpenAI/Cohere/sentence-transformers/local) - `indexfoundry_index_upsert` - Write to vector DB (Pinecone/Weaviate/Qdrant/Milvus/Chroma/local) - `indexfoundry_index_build_profile` - Configure retrieval (top_k, hybrid search, reranking) #### Serve Phase - `indexfoundry_serve_openapi` - Generate OpenAPI 3.1 specification - `indexfoundry_serve_start` - Start HTTP search API server - `indexfoundry_serve_stop` - Stop running API server - `indexfoundry_serve_status` - Get server status - `indexfoundry_serve_query` - Query running server directly #### Run Utilities - `indexfoundry_run_status` - Get detailed status of a run - `indexfoundry_run_list` - List all runs with filtering - `indexfoundry_run_diff` - Compare two runs (config, chunks, timing) - `indexfoundry_run_cleanup` - Delete old runs with retention policies ### Project-Based Workflow Tools #### Project Management - `indexfoundry_project_create` - Create a new project with embedding and chunk config - `indexfoundry_project_list` - List all projects with optional statistics - `indexfoundry_project_get` - Get project details, manifest, and sources - `indexfoundry_project_delete` - Delete a project (requires `confirm: true`) #### Source Management - `indexfoundry_project_add_source` - Add data source (url/sitemap/folder/pdf) with tags #### Build & Query - `indexfoundry_project_build` - Process all pending sources (fetch, chunk, embed, upsert) - `indexfoundry_project_query` - Search project's vector database (semantic/keyword/hybrid) #### Deployment - `indexfoundry_project_export` - Generate deployment files (Dockerfile, MCP server, railway.toml) ## Directory Structures ### Run-Based Structure ``` runs/<run_id>/ ├── manifest.json # Master audit trail ├── config.json # Frozen config ├── raw/ # Fetched artifacts ├── extracted/ # Text extraction ├── normalized/ # Chunks ├── indexed/ # Embeddings ├── served/ # API artifacts └── logs/ # Event logs ``` ### Project-Based Structure ``` projects/<project_id>/ ├── project.json # Project manifest (embedding config, stats) ├── sources.jsonl # Source records (url/sitemap/folder/pdf) ├── data/ │ ├── chunks.jsonl # Indexed chunks │ └── vectors.jsonl # Generated embeddings ├── runs/ # Per-source build runs ├── src/ │ └── index.ts # Generated MCP server ├── Dockerfile # Container configuration ├── railway.toml # Railway deployment config ├── package.json # Server dependencies ├── tsconfig.json # TypeScript config └── README.md # Project documentation ``` ## Configuration ### Environment Variables ```bash # Run-based pipeline INDEXFOUNDRY_RUNS_DIR=./runs # Where to store runs # Embeddings OPENAI_API_KEY=sk-... # For OpenAI embeddings EMBEDDING_API_KEY=sk-... # Generic env variable (configurable per project) # Server PORT=3000 # For HTTP transport TRANSPORT=stdio # stdio or http ``` ### Project Configuration Projects store configuration in `project.json`: ```json { "project_id": "my-rag", "name": "My RAG Search", "embedding_model": { "provider": "openai", "model_name": "text-embedding-3-small", "api_key_env": "OPENAI_API_KEY" }, "chunk_config": { "strategy": "recursive", "max_chars": 1500, "overlap_chars": 150 } } ``` ## Example Usage ### Run-Based Pipeline Example ```typescript // Create a new run const runId = crypto.randomUUID(); // Connect: fetch from folder await client.callTool("indexfoundry_connect_folder", { run_id: runId, path: "/path/to/documents", glob: "**/*.pdf" }); // Extract: PDF to text await client.callTool("indexfoundry_extract_pdf", { run_id: runId, pdf_path: "raw/<sha256>.pdf", mode: "layout" }); // Normalize: chunk text await client.callTool("indexfoundry_normalize_chunk", { run_id: runId, input_paths: ["extracted/<sha256>.pages.jsonl"], strategy: "recursive", max_chars: 1500, overlap_chars: 150 }); // Index: generate embeddings await client.callTool("indexfoundry_index_embed", { run_id: runId, model: { provider: "openai", model_name: "text-embedding-3-small", api_key_env: "OPENAI_API_KEY" } }); // Upsert to local vector DB await client.callTool("indexfoundry_index_upsert", { run_id: runId, provider: "local", connection: { collection: "my_docs" } }); // Serve: start HTTP API await client.callTool("indexfoundry_serve_start", { run_id: runId, port: 8080 }); ``` ### Project-Based Workflow Example ```typescript // Create a deployable RAG project await client.callTool("indexfoundry_project_create", { project_id: "my-docs-rag", name: "Company Documentation Search", description: "Searchable knowledge base for internal docs", embedding_model: { provider: "openai", model_name: "text-embedding-3-small", api_key_env": "OPENAI_API_KEY" }, chunk_config: { strategy: "recursive", max_chars: 1500, overlap_chars: 150 } }); // Add multiple sources await client.callTool("indexfoundry_project_add_source", { project_id: "my-docs-rag", url: "https://docs.company.com", source_name: "Main Docs", tags: ["docs", "internal"] }); await client.callTool("indexfoundry_project_add_source", { project_id: "my-docs-rag", folder_path: "/path/to/pdfs", source_name: "Policy Documents", tags: ["policy", "pdf"] }); // Build: vector database await client.callTool("indexfoundry_project_build", { project_id: "my-docs-rag" }); // Query: index const results = await client.callTool("indexfoundry_project_query", { project_id: "my-docs-rag", query: "What is the vacation policy?", mode: "hybrid", top_k: 5, filter_tags: ["policy"] }); // Export for deployment await client.callTool("indexfoundry_project_export", { project_id: "my-docs-rag", server_name: "docs-search-server", server_description: "Internal documentation search API", include_http: true, railway_config: true }); ``` After export, a project directory contains a deployable repository: ```bash cd projects/my-docs-rag git init git add . git commit -m "Initial RAG application" git push origin main # Deploy on Railway ``` ## Development ```bash # Development with watch mode npm run dev # Run tests npm test # Lint npm run lint # Test with MCP Inspector npm run inspector ``` ## Testing The MCP server has been validated with end-to-end testing: - ✅ Project creation, listing, and retrieval - ✅ Source addition (URL, folder, PDF, sitemap) - ✅ Build pipeline (fetch → chunk → embed → upsert) - ✅ Vector search with semantic, keyword, and hybrid modes - ✅ Deployment file generation (Dockerfile, railway.toml, MCP server) ## Deployment ### Railway Deployment 1. Create and export a project: ```typescript await client.callTool("indexfoundry_project_export", { project_id: "my-rag", railway_config: true }); ``` 2. Push to GitHub and connect to Railway 3. Railway automatically detects `railway.toml` and deploys ### Docker Deployment ```bash cd projects/my-rag docker build -t my-rag-server . docker run -p 8080:8080 -e OPENAI_API_KEY=sk-... my-rag-server ``` ## Determinism Guarantees 1. **Sorted inputs**: File lists sorted before processing 2. **Stable IDs**: Chunk IDs derived from content + position 3. **Content hashes**: SHA256 on every artifact 4. **Pinned versions**: Extractor versions locked in config 5. **No randomness**: No sampling, shuffling, or non-deterministic algorithms ## License MIT

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Mnehmos/mnehmos.index-foundry.mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

README.md•13 KiB