index_document

Add documents to Elasticsearch with duplicate detection and automatic ID generation to organize knowledge base content.

Instructions

Index a document into Elasticsearch with smart duplicate prevention and intelligent document ID generation. 💡 RECOMMENDED: Use 'create_document_template' tool first to generate a proper document structure and avoid validation errors.

Input Schema

TableJSON Schema

Name	Required	Description
`index`	Yes	Name of the Elasticsearch index to store the document
`document`	Yes	Document data to index as JSON object. 💡 RECOMMENDED: Use 'create_document_template' tool first to generate proper document format.
`doc_id`	No	Optional document ID - if not provided, smart ID will be generated
`validate_schema`	No	Whether to validate document structure for knowledge base format
`check_duplicates`	No	Check for existing documents with similar title before indexing
`force_index`	No	Force indexing even if potential duplicates are found. 💡 TIP: Set to True if content is genuinely new and not in knowledge base to avoid multiple tool calls
`use_ai_similarity`	No	Use AI to analyze content similarity and provide intelligent recommendations

Implementation Reference

src/elasticsearch/sub_servers/elasticsearch_document.py:126-341 (handler)
Core implementation of the index_document tool. Handles Elasticsearch indexing with advanced features: smart duplicate detection via title matching, AI content similarity analysis, optional schema validation, auto-generated document IDs with collision avoidance, force indexing option, and comprehensive user-friendly error messages with actionable suggestions.
@app.tool( description="Index a document into Elasticsearch with smart duplicate prevention and intelligent document ID generation. 💡 RECOMMENDED: Use 'create_document_template' tool first to generate a proper document structure and avoid validation errors.", tags={"elasticsearch", "index", "document", "validation", "duplicate-prevention"} ) async def index_document( index: Annotated[str, Field(description="Name of the Elasticsearch index to store the document")], document: Annotated[Dict[str, Any], Field(description="Document data to index as JSON object. 💡 RECOMMENDED: Use 'create_document_template' tool first to generate proper document format.")], doc_id: Annotated[Optional[str], Field( description="Optional document ID - if not provided, smart ID will be generated")] = None, validate_schema: Annotated[ bool, Field(description="Whether to validate document structure for knowledge base format")] = True, check_duplicates: Annotated[ bool, Field(description="Check for existing documents with similar title before indexing")] = True, force_index: Annotated[ bool, Field(description="Force indexing even if potential duplicates are found. 💡 TIP: Set to True if content is genuinely new and not in knowledge base to avoid multiple tool calls")] = False, use_ai_similarity: Annotated[bool, Field( description="Use AI to analyze content similarity and provide intelligent recommendations")] = True, ctx: Context = None ) -> str: """Index a document into Elasticsearch with smart duplicate prevention.""" try: es = get_es_client() # Smart duplicate checking if enabled if check_duplicates and not force_index: title = document.get('title', '') content = document.get('content', '') if title: # First check simple title duplicates dup_check = check_title_duplicates(es, index, title) if dup_check['found']: duplicates_info = "\n".join([ f" 📄 {dup['title']} (ID: {dup['id']})\n 📝 {dup['summary']}\n 📅 {dup['last_modified']}" for dup in dup_check['duplicates'][:3] ]) # Use AI similarity analysis if enabled and content is substantial if use_ai_similarity and content and len(content) > 200 and ctx: try: ai_analysis = await check_content_similarity_with_ai(es, index, title, content, ctx) action = ai_analysis.get('action', 'CREATE') confidence = ai_analysis.get('confidence', 0.5) reasoning = ai_analysis.get('reasoning', 'AI analysis completed') target_doc = ai_analysis.get('target_document_id', '') ai_message = f"\n\n🤖 **AI Content Analysis** (Confidence: {confidence:.0%}):\n" ai_message += f" 🎯 **Recommended Action**: {action}\n" ai_message += f" 💭 **AI Reasoning**: {reasoning}\n" if action == "UPDATE" and target_doc: ai_message += f" 📄 **Target Document**: {target_doc}\n" ai_message += f" 💡 **Suggestion**: Update existing document instead of creating new one\n" elif action == "DELETE": ai_message += f" 🗑️ **AI Recommendation**: Existing content is superior, consider not creating this document\n" elif action == "MERGE" and target_doc: ai_message += f" 🔄 **Merge Target**: {target_doc}\n" ai_message += f" 📝 **Strategy**: {ai_analysis.get('merge_strategy', 'Combine unique information from both documents')}\n" elif action == "CREATE": ai_message += f" ✅ **AI Approval**: Content is sufficiently unique to create new document\n" # If AI says CREATE, allow automatic indexing pass # Show similar documents found by AI similar_docs = ai_analysis.get('similar_docs', []) if similar_docs: ai_message += f"\n 📋 **Similar Documents Analyzed**:\n" for i, doc in enumerate(similar_docs[:2], 1): ai_message += f" {i}. {doc['title']} (Score: {doc.get('elasticsearch_score', 0):.1f})\n" # If AI recommends CREATE with high confidence, proceed automatically if action == "CREATE" and confidence > 0.8: # Continue with indexing - don't return early pass else: # Return AI analysis for user review return ( f"⚠️ **Potential Duplicates Found** - {dup_check['count']} similar document(s):\n\n" + f"{duplicates_info}\n" + f"{ai_message}\n\n" + f"🤔 **What would you like to do?**\n" + f" 1️⃣ **FOLLOW AI RECOMMENDATION**: {action} as suggested by AI\n" + f" 2️⃣ **UPDATE existing document**: Modify one of the above instead\n" + f" 3️⃣ **SEARCH for more**: Use search tool to find all related content\n" + f" 4️⃣ **FORCE CREATE anyway**: Set force_index=True if this is truly unique\n\n" + f"💡 **AI Recommendation**: {reasoning}\n" + f"🔍 **Next Step**: Search for '{title}' to see all related documents\n\n" + f"⚡ **To force indexing**: Call again with force_index=True") except Exception as ai_error: # Fallback to simple duplicate check if AI fails return ( f"⚠️ **Potential Duplicates Found** - {dup_check['count']} similar document(s):\n\n" + f"{duplicates_info}\n\n" + f"⚠️ **AI Analysis Failed**: {str(ai_error)}\n\n" + f"🤔 **What would you like to do?**\n" + f" 1️⃣ **UPDATE existing document**: Modify one of the above instead\n" + f" 2️⃣ **SEARCH for more**: Use search tool to find all related content\n" + f" 3️⃣ **FORCE CREATE anyway**: Set force_index=True if this is truly unique\n\n" + f"💡 **Recommendation**: Update existing documents to prevent knowledge base bloat\n" + f"🔍 **Next Step**: Search for '{title}' to see all related documents\n\n" + f"⚡ **To force indexing**: Call again with force_index=True") else: # Simple duplicate check without AI return (f"⚠️ **Potential Duplicates Found** - {dup_check['count']} similar document(s):\n\n" + f"{duplicates_info}\n\n" + f"🤔 **What would you like to do?**\n" + f" 1️⃣ **UPDATE existing document**: Modify one of the above instead\n" + f" 2️⃣ **SEARCH for more**: Use search tool to find all related content\n" + f" 3️⃣ **FORCE CREATE anyway**: Set force_index=True if this is truly unique\n\n" + f"💡 **Recommendation**: Update existing documents to prevent knowledge base bloat\n" + f"🔍 **Next Step**: Search for '{title}' to see all related documents\n\n" + f"⚡ **To force indexing**: Call again with force_index=True") # Generate smart document ID if not provided if not doc_id: existing_ids = get_existing_document_ids(es, index) doc_id = generate_smart_doc_id( document.get('title', 'untitled'), document.get('content', ''), existing_ids ) document['id'] = doc_id # Ensure document has the ID # Validate document structure if requested if validate_schema: try: # Check if this looks like a knowledge base document if isinstance(document, dict) and "id" in document and "title" in document: validated_doc = validate_document_structure(document) document = validated_doc # Use the document ID from the validated document if not provided earlier if not doc_id: doc_id = document.get("id") else: # For non-knowledge base documents, still validate with strict mode if enabled validated_doc = validate_document_structure(document, is_knowledge_doc=False) document = validated_doc except DocumentValidationError as e: return f"❌ Validation failed:\n\n{format_validation_error(e)}" except Exception as e: return f"❌ Validation error: {str(e)}" # Index the document result = es.index(index=index, id=doc_id, body=document) success_message = f"✅ Document indexed successfully:\n\n{json.dumps(result, indent=2, ensure_ascii=False)}" # Add smart guidance based on indexing result if result.get('result') == 'created': success_message += f"\n\n🎉 **New Document Created**:\n" success_message += f" 📄 **Document ID**: {doc_id}\n" success_message += f" 🆔 **ID Strategy**: {'User-provided' if 'doc_id' in locals() and doc_id else 'Smart-generated'}\n" if check_duplicates: success_message += f" ✅ **Duplicate Check**: Passed - no similar titles found\n" else: success_message += f"\n\n🔄 **Document Updated**:\n" success_message += f" 📄 **Document ID**: {doc_id}\n" success_message += f" ⚡ **Action**: Replaced existing document with same ID\n" success_message += (f"\n\n💡 **Smart Duplicate Prevention Active**:\n" + f" 🔍 **Auto-Check**: {'Enabled' if check_duplicates else 'Disabled'} - searches for similar titles\n" + f" 🤖 **AI Analysis**: {'Enabled' if use_ai_similarity else 'Disabled'} - intelligent content similarity detection\n" + f" 🆔 **Smart IDs**: Auto-generated from title with collision detection\n" + f" ⚡ **Force Option**: Use force_index=True to bypass duplicate warnings\n" + f" 🔄 **Update Recommended**: Modify existing documents instead of creating duplicates\n\n" + f"🤝 **Best Practices**:\n" + f" • Search before creating: 'search(index=\"{index}\", query=\"your topic\")'\n" + f" • Update existing documents when possible\n" + f" • Use descriptive titles for better smart ID generation\n" + f" • AI will analyze content similarity for intelligent recommendations\n" + f" • Set force_index=True only when content is truly unique") return success_message except Exception as e: # Provide detailed error messages for different types of Elasticsearch errors error_message = "❌ Document indexing failed:\n\n" error_str = str(e).lower() if "connection" in error_str or "refused" in error_str: error_message += "🔌 **Connection Error**: Cannot connect to Elasticsearch server\n" error_message += f"📍 Check if Elasticsearch is running at the configured address\n" error_message += f"💡 Try: Use 'setup_elasticsearch' tool to start Elasticsearch\n\n" elif ("index" in error_str and "not found" in error_str) or "index_not_found_exception" in error_str: error_message += f"📁 **Index Error**: Index '{index}' does not exist\n" error_message += f"📍 The target index has not been created yet\n" error_message += f"💡 **Suggestions for agents**:\n" error_message += f" 1. Use 'create_index' tool to create the index first\n" error_message += f" 2. Use 'list_indices' to see available indices\n" error_message += f" 3. Check the correct index name for your data type\n\n" elif "mapping" in error_str or "field" in error_str: error_message += f"🗂️ **Mapping Error**: Document structure conflicts with index mapping\n" error_message += f"📍 Document fields don't match the expected index schema\n" error_message += f"💡 Try: Adjust document structure or update index mapping\n\n" elif "version" in error_str or "conflict" in error_str: error_message += f"⚡ **Version Conflict**: Document already exists with different version\n" error_message += f"📍 Another process modified this document simultaneously\n" error_message += f"💡 Try: Use 'get_document' first, then update with latest version\n\n" elif "timeout" in error_str: error_message += "⏱️ **Timeout Error**: Indexing operation timed out\n" error_message += f"📍 Document may be too large or index overloaded\n" error_message += f"💡 Try: Reduce document size or retry later\n\n" else: error_message += f"⚠️ **Unknown Error**: {str(e)}\n\n" error_message += f"🔍 **Technical Details**: {str(e)}" return error_message
src/elasticsearch/document_schema.py:168-277 (schema)
Document schema validation function validate_document_structure, loaded from config.json. Enforces required fields, data types, priority/source_type enums, ID format, timestamp format, non-empty lists/strings, and strict mode options. Called by index_document when validate_schema=True.
def validate_document_structure(document: Dict[str, Any], base_directory: str = None, is_knowledge_doc: bool = True) -> Dict[str, Any]: """ Validate document structure against schema with strict mode support. Args: document: Document to validate base_directory: Base directory for relative path conversion is_knowledge_doc: Whether this is a knowledge base document (default: True) Returns: Validated and normalized document Raises: DocumentValidationError: If validation fails """ errors = [] validation_config = load_validation_config() document_schema = load_document_schema() # For knowledge base documents, check the full schema if is_knowledge_doc: # Check for extra fields if strict validation is enabled if validation_config.get("strict_schema_validation", False) and not validation_config.get("allow_extra_fields", True): allowed_fields = set(document_schema["required_fields"]) document_fields = set(document.keys()) extra_fields = document_fields - allowed_fields if extra_fields: errors.append(f"Extra fields not allowed in strict mode: {', '.join(sorted(extra_fields))}. Allowed fields: {', '.join(sorted(allowed_fields))}") else: # For non-knowledge documents, only check for extra fields if strict validation is enabled if validation_config.get("strict_schema_validation", False) and not validation_config.get("allow_extra_fields", True): # For non-knowledge docs, we don't have a predefined schema, so just enforce no extra fields beyond basic ones # This is a more lenient check - you might want to customize this based on your needs errors.append("Strict schema validation is enabled. Extra fields are not allowed for custom documents.") # Check required fields only for knowledge base documents if is_knowledge_doc: required_fields = document_schema["required_fields"] if validation_config.get("required_fields_only", False): # Only check fields that are actually required for field in required_fields: if field not in document: errors.append(f"Missing required field: {field}") else: # Check all fields in schema for field in required_fields: if field not in document: errors.append(f"Missing required field: {field}") if errors: raise DocumentValidationError("Validation failed: " + "; ".join(errors)) # For knowledge base documents, perform detailed validation if is_knowledge_doc: # Validate field types for field, expected_type in document_schema["field_types"].items(): if field in document: if not isinstance(document[field], expected_type): errors.append(f"Field '{field}' must be of type {expected_type.__name__}, got {type(document[field]).__name__}") # NEW: Validate content length if document.get("content"): content = document["content"] # Check for empty content if not content.strip(): errors.append("Content cannot be empty or contain only whitespace") # Validate priority values if document.get("priority") not in document_schema["priority_values"]: errors.append(f"Priority must be one of {document_schema['priority_values']}, got '{document.get('priority')}'") # Validate source_type if document.get("source_type") not in document_schema["source_types"]: errors.append(f"Source type must be one of {document_schema['source_types']}, got '{document.get('source_type')}'") # Validate ID format (should be alphanumeric with hyphens) if document.get("id") and not re.match(r'^[a-zA-Z0-9-_]+$', document["id"]): errors.append("ID must contain only alphanumeric characters, hyphens, and underscores") # Validate timestamp format if document.get("last_modified"): try: datetime.fromisoformat(document["last_modified"].replace('Z', '+00:00')) except ValueError: errors.append("last_modified must be in ISO 8601 format (e.g., '2025-01-04T10:30:00Z')") # Validate tags (must be non-empty strings) if document.get("tags"): for i, tag in enumerate(document["tags"]): if not isinstance(tag, str) or not tag.strip(): errors.append(f"Tag at index {i} must be a non-empty string") # Validate related documents (must be strings) if document.get("related"): for i, related_id in enumerate(document["related"]): if not isinstance(related_id, str) or not related_id.strip(): errors.append(f"Related document ID at index {i} must be a non-empty string") # Validate key_points (must be non-empty strings) if document.get("key_points"): for i, point in enumerate(document["key_points"]): if not isinstance(point, str) or not point.strip(): errors.append(f"Key point at index {i} must be a non-empty string") if errors: raise DocumentValidationError("Validation failed: " + "; ".join(errors)) return document
src/elasticsearch/elasticsearch_server.py:27-48 (registration)
Mounting of the elasticsearch_document sub-server app (containing index_document) into the unified elasticsearch_server app at line 48: app.mount(document_app). Makes the tool available in the elasticsearch namespace.
from .sub_servers.elasticsearch_document import app as document_app from .sub_servers.elasticsearch_index import app as index_app from .sub_servers.elasticsearch_search import app as search_app from .sub_servers.elasticsearch_batch import app as batch_app # Create unified FastMCP application app = FastMCP( name="AgentKnowledgeMCP-Elasticsearch", version="2.0.0", instructions="Unified Elasticsearch tools for comprehensive knowledge management via modular server mounting" ) # ================================ # SERVER MOUNTING - MODULAR ARCHITECTURE # ================================ print("🏗️ Mounting Elasticsearch sub-servers...") # Mount all sub-servers into unified interface app.mount(snapshots_app) # 3 tools: snapshot management app.mount(index_metadata_app) # 3 tools: metadata governance app.mount(document_app) # 3 tools: document operations
src/main_server.py:76-76 (registration)
Mounting of the elasticsearch_server app (including index_document) into the main FastMCP server app: app.mount(elasticsearch_server_app). Exposes the tool globally without prefix for backward compatibility.
app.mount(elasticsearch_server_app)
src/elasticsearch/elasticsearch_helper.py:421-551 (helper)
AI-powered duplicate/content similarity checker check_content_similarity_with_ai used by index_document. Performs Elasticsearch similarity search followed by LLM analysis to recommend CREATE/UPDATE/DELETE/MERGE actions with confidence scores and reasoning.
async def check_content_similarity_with_ai(es, index: str, title: str, content: str, ctx: Context, similarity_threshold: float = 0.7) -> dict: """ Advanced content similarity checking using AI analysis. Returns recommendations for UPDATE, DELETE, CREATE, or MERGE actions. """ try: # First, find potentially similar documents using Elasticsearch similar_docs = [] # Search for documents with similar titles or content if len(content) > 100: search_query = { "query": { "bool": { "should": [ {"match": {"title": {"query": title, "boost": 3.0}}}, {"match": {"content": {"query": content[:500], "boost": 1.0}}}, {"more_like_this": { "fields": ["content", "title"], "like": content[:1000], "min_term_freq": 1, "max_query_terms": 8, "minimum_should_match": "30%" }} ] } }, "size": 5, "_source": ["title", "summary", "content", "last_modified", "id"] } result = es.search(index=index, body=search_query) # Collect similar documents for hit in result['hits']['hits']: source = hit['_source'] similar_docs.append({ "id": hit['_id'], "title": source.get('title', ''), "summary": source.get('summary', '')[:200] + "..." if len(source.get('summary', '')) > 200 else source.get('summary', ''), "content_preview": source.get('content', '')[:300] + "..." if len(source.get('content', '')) > 300 else source.get('content', ''), "last_modified": source.get('last_modified', ''), "elasticsearch_score": hit['_score'] }) # If no similar documents found, recommend CREATE if not similar_docs: return { "action": "CREATE", "confidence": 0.95, "reason": "No similar content found in knowledge base", "similar_docs": [], "ai_analysis": "Content appears to be unique and should be created as new document" } # Use AI to analyze content similarity and recommend action ai_prompt = f"""You are an intelligent duplicate detection system. Analyze the new document against existing similar documents and recommend the best action. NEW DOCUMENT: Title: {title} Content: {content[:1500]}{"..." if len(content) > 1500 else ""} EXISTING SIMILAR DOCUMENTS: """ for i, doc in enumerate(similar_docs[:3], 1): ai_prompt += f""" Document {i}: {doc['title']} (ID: {doc['id']}) Summary: {doc['summary']} Content Preview: {doc['content_preview']} Last Modified: {doc['last_modified']} ---""" ai_prompt += f""" Please analyze and provide: 1. Content similarity percentage (0-100%) for each existing document 2. Recommended action: UPDATE, DELETE, CREATE, or MERGE 3. Detailed reasoning for your recommendation 4. Which specific document to update/merge with (if applicable) Guidelines: - UPDATE: If new content is an improved version of existing content (>70% similar) - DELETE: If existing content is clearly superior and new content adds no value (>85% similar) - MERGE: If both contents have valuable unique information (40-70% similar) - CREATE: If content is sufficiently different and valuable (<40% similar) Respond in JSON format: {{ "similarity_scores": [85, 60, 20], "recommended_action": "UPDATE|DELETE|CREATE|MERGE", "confidence": 0.85, "target_document_id": "doc-id-if-update-or-merge", "reasoning": "Detailed explanation of why this action is recommended", "merge_strategy": "How to combine documents if MERGE is recommended" }} Consider: - Content quality and completeness - Information uniqueness and value - Documentation freshness and accuracy - Knowledge base organization""" # Get AI analysis response = await ctx.sample( messages=ai_prompt, system_prompt="You are an expert knowledge management AI. Analyze content similarity and recommend the optimal action to maintain a high-quality, organized knowledge base. Always respond with valid JSON.", model_preferences=["claude-3-opus", "claude-3-sonnet", "gpt-4"], temperature=0.3, max_tokens=600 ) # Parse AI response ai_analysis = json.loads(response.text.strip()) # Add similar documents to response ai_analysis["similar_docs"] = similar_docs ai_analysis["ai_analysis"] = response.text return ai_analysis except Exception as e: # Fallback to simple duplicate check if AI analysis fails return { "action": "CREATE", "confidence": 0.6, "reason": f"AI analysis failed ({str(e)}), defaulting to CREATE", "similar_docs": similar_docs if 'similar_docs' in locals() else [], "ai_analysis": f"Error during AI analysis: {str(e)}" }

Agent Knowledge MCP

index_document

Instructions

Input Schema

Implementation Reference

Other Tools

Latest Blog Posts

MCP directory API