Skip to main content
Glama

index_document

Add documents to Elasticsearch with duplicate detection and automatic ID generation to organize knowledge base content.

Instructions

Index a document into Elasticsearch with smart duplicate prevention and intelligent document ID generation. šŸ’” RECOMMENDED: Use 'create_document_template' tool first to generate a proper document structure and avoid validation errors.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
indexYesName of the Elasticsearch index to store the document
documentYesDocument data to index as JSON object. šŸ’” RECOMMENDED: Use 'create_document_template' tool first to generate proper document format.
doc_idNoOptional document ID - if not provided, smart ID will be generated
validate_schemaNoWhether to validate document structure for knowledge base format
check_duplicatesNoCheck for existing documents with similar title before indexing
force_indexNoForce indexing even if potential duplicates are found. šŸ’” TIP: Set to True if content is genuinely new and not in knowledge base to avoid multiple tool calls
use_ai_similarityNoUse AI to analyze content similarity and provide intelligent recommendations

Implementation Reference

  • Core implementation of the index_document tool. Handles Elasticsearch indexing with advanced features: smart duplicate detection via title matching, AI content similarity analysis, optional schema validation, auto-generated document IDs with collision avoidance, force indexing option, and comprehensive user-friendly error messages with actionable suggestions.
    @app.tool( description="Index a document into Elasticsearch with smart duplicate prevention and intelligent document ID generation. šŸ’” RECOMMENDED: Use 'create_document_template' tool first to generate a proper document structure and avoid validation errors.", tags={"elasticsearch", "index", "document", "validation", "duplicate-prevention"} ) async def index_document( index: Annotated[str, Field(description="Name of the Elasticsearch index to store the document")], document: Annotated[Dict[str, Any], Field(description="Document data to index as JSON object. šŸ’” RECOMMENDED: Use 'create_document_template' tool first to generate proper document format.")], doc_id: Annotated[Optional[str], Field( description="Optional document ID - if not provided, smart ID will be generated")] = None, validate_schema: Annotated[ bool, Field(description="Whether to validate document structure for knowledge base format")] = True, check_duplicates: Annotated[ bool, Field(description="Check for existing documents with similar title before indexing")] = True, force_index: Annotated[ bool, Field(description="Force indexing even if potential duplicates are found. šŸ’” TIP: Set to True if content is genuinely new and not in knowledge base to avoid multiple tool calls")] = False, use_ai_similarity: Annotated[bool, Field( description="Use AI to analyze content similarity and provide intelligent recommendations")] = True, ctx: Context = None ) -> str: """Index a document into Elasticsearch with smart duplicate prevention.""" try: es = get_es_client() # Smart duplicate checking if enabled if check_duplicates and not force_index: title = document.get('title', '') content = document.get('content', '') if title: # First check simple title duplicates dup_check = check_title_duplicates(es, index, title) if dup_check['found']: duplicates_info = "\n".join([ f" šŸ“„ {dup['title']} (ID: {dup['id']})\n šŸ“ {dup['summary']}\n šŸ“… {dup['last_modified']}" for dup in dup_check['duplicates'][:3] ]) # Use AI similarity analysis if enabled and content is substantial if use_ai_similarity and content and len(content) > 200 and ctx: try: ai_analysis = await check_content_similarity_with_ai(es, index, title, content, ctx) action = ai_analysis.get('action', 'CREATE') confidence = ai_analysis.get('confidence', 0.5) reasoning = ai_analysis.get('reasoning', 'AI analysis completed') target_doc = ai_analysis.get('target_document_id', '') ai_message = f"\n\nšŸ¤– **AI Content Analysis** (Confidence: {confidence:.0%}):\n" ai_message += f" šŸŽÆ **Recommended Action**: {action}\n" ai_message += f" šŸ’­ **AI Reasoning**: {reasoning}\n" if action == "UPDATE" and target_doc: ai_message += f" šŸ“„ **Target Document**: {target_doc}\n" ai_message += f" šŸ’” **Suggestion**: Update existing document instead of creating new one\n" elif action == "DELETE": ai_message += f" šŸ—‘ļø **AI Recommendation**: Existing content is superior, consider not creating this document\n" elif action == "MERGE" and target_doc: ai_message += f" šŸ”„ **Merge Target**: {target_doc}\n" ai_message += f" šŸ“ **Strategy**: {ai_analysis.get('merge_strategy', 'Combine unique information from both documents')}\n" elif action == "CREATE": ai_message += f" āœ… **AI Approval**: Content is sufficiently unique to create new document\n" # If AI says CREATE, allow automatic indexing pass # Show similar documents found by AI similar_docs = ai_analysis.get('similar_docs', []) if similar_docs: ai_message += f"\n šŸ“‹ **Similar Documents Analyzed**:\n" for i, doc in enumerate(similar_docs[:2], 1): ai_message += f" {i}. {doc['title']} (Score: {doc.get('elasticsearch_score', 0):.1f})\n" # If AI recommends CREATE with high confidence, proceed automatically if action == "CREATE" and confidence > 0.8: # Continue with indexing - don't return early pass else: # Return AI analysis for user review return ( f"āš ļø **Potential Duplicates Found** - {dup_check['count']} similar document(s):\n\n" + f"{duplicates_info}\n" + f"{ai_message}\n\n" + f"šŸ¤” **What would you like to do?**\n" + f" 1ļøāƒ£ **FOLLOW AI RECOMMENDATION**: {action} as suggested by AI\n" + f" 2ļøāƒ£ **UPDATE existing document**: Modify one of the above instead\n" + f" 3ļøāƒ£ **SEARCH for more**: Use search tool to find all related content\n" + f" 4ļøāƒ£ **FORCE CREATE anyway**: Set force_index=True if this is truly unique\n\n" + f"šŸ’” **AI Recommendation**: {reasoning}\n" + f"šŸ” **Next Step**: Search for '{title}' to see all related documents\n\n" + f"⚔ **To force indexing**: Call again with force_index=True") except Exception as ai_error: # Fallback to simple duplicate check if AI fails return ( f"āš ļø **Potential Duplicates Found** - {dup_check['count']} similar document(s):\n\n" + f"{duplicates_info}\n\n" + f"āš ļø **AI Analysis Failed**: {str(ai_error)}\n\n" + f"šŸ¤” **What would you like to do?**\n" + f" 1ļøāƒ£ **UPDATE existing document**: Modify one of the above instead\n" + f" 2ļøāƒ£ **SEARCH for more**: Use search tool to find all related content\n" + f" 3ļøāƒ£ **FORCE CREATE anyway**: Set force_index=True if this is truly unique\n\n" + f"šŸ’” **Recommendation**: Update existing documents to prevent knowledge base bloat\n" + f"šŸ” **Next Step**: Search for '{title}' to see all related documents\n\n" + f"⚔ **To force indexing**: Call again with force_index=True") else: # Simple duplicate check without AI return (f"āš ļø **Potential Duplicates Found** - {dup_check['count']} similar document(s):\n\n" + f"{duplicates_info}\n\n" + f"šŸ¤” **What would you like to do?**\n" + f" 1ļøāƒ£ **UPDATE existing document**: Modify one of the above instead\n" + f" 2ļøāƒ£ **SEARCH for more**: Use search tool to find all related content\n" + f" 3ļøāƒ£ **FORCE CREATE anyway**: Set force_index=True if this is truly unique\n\n" + f"šŸ’” **Recommendation**: Update existing documents to prevent knowledge base bloat\n" + f"šŸ” **Next Step**: Search for '{title}' to see all related documents\n\n" + f"⚔ **To force indexing**: Call again with force_index=True") # Generate smart document ID if not provided if not doc_id: existing_ids = get_existing_document_ids(es, index) doc_id = generate_smart_doc_id( document.get('title', 'untitled'), document.get('content', ''), existing_ids ) document['id'] = doc_id # Ensure document has the ID # Validate document structure if requested if validate_schema: try: # Check if this looks like a knowledge base document if isinstance(document, dict) and "id" in document and "title" in document: validated_doc = validate_document_structure(document) document = validated_doc # Use the document ID from the validated document if not provided earlier if not doc_id: doc_id = document.get("id") else: # For non-knowledge base documents, still validate with strict mode if enabled validated_doc = validate_document_structure(document, is_knowledge_doc=False) document = validated_doc except DocumentValidationError as e: return f"āŒ Validation failed:\n\n{format_validation_error(e)}" except Exception as e: return f"āŒ Validation error: {str(e)}" # Index the document result = es.index(index=index, id=doc_id, body=document) success_message = f"āœ… Document indexed successfully:\n\n{json.dumps(result, indent=2, ensure_ascii=False)}" # Add smart guidance based on indexing result if result.get('result') == 'created': success_message += f"\n\nšŸŽ‰ **New Document Created**:\n" success_message += f" šŸ“„ **Document ID**: {doc_id}\n" success_message += f" šŸ†” **ID Strategy**: {'User-provided' if 'doc_id' in locals() and doc_id else 'Smart-generated'}\n" if check_duplicates: success_message += f" āœ… **Duplicate Check**: Passed - no similar titles found\n" else: success_message += f"\n\nšŸ”„ **Document Updated**:\n" success_message += f" šŸ“„ **Document ID**: {doc_id}\n" success_message += f" ⚔ **Action**: Replaced existing document with same ID\n" success_message += (f"\n\nšŸ’” **Smart Duplicate Prevention Active**:\n" + f" šŸ” **Auto-Check**: {'Enabled' if check_duplicates else 'Disabled'} - searches for similar titles\n" + f" šŸ¤– **AI Analysis**: {'Enabled' if use_ai_similarity else 'Disabled'} - intelligent content similarity detection\n" + f" šŸ†” **Smart IDs**: Auto-generated from title with collision detection\n" + f" ⚔ **Force Option**: Use force_index=True to bypass duplicate warnings\n" + f" šŸ”„ **Update Recommended**: Modify existing documents instead of creating duplicates\n\n" + f"šŸ¤ **Best Practices**:\n" + f" • Search before creating: 'search(index=\"{index}\", query=\"your topic\")'\n" + f" • Update existing documents when possible\n" + f" • Use descriptive titles for better smart ID generation\n" + f" • AI will analyze content similarity for intelligent recommendations\n" + f" • Set force_index=True only when content is truly unique") return success_message except Exception as e: # Provide detailed error messages for different types of Elasticsearch errors error_message = "āŒ Document indexing failed:\n\n" error_str = str(e).lower() if "connection" in error_str or "refused" in error_str: error_message += "šŸ”Œ **Connection Error**: Cannot connect to Elasticsearch server\n" error_message += f"šŸ“ Check if Elasticsearch is running at the configured address\n" error_message += f"šŸ’” Try: Use 'setup_elasticsearch' tool to start Elasticsearch\n\n" elif ("index" in error_str and "not found" in error_str) or "index_not_found_exception" in error_str: error_message += f"šŸ“ **Index Error**: Index '{index}' does not exist\n" error_message += f"šŸ“ The target index has not been created yet\n" error_message += f"šŸ’” **Suggestions for agents**:\n" error_message += f" 1. Use 'create_index' tool to create the index first\n" error_message += f" 2. Use 'list_indices' to see available indices\n" error_message += f" 3. Check the correct index name for your data type\n\n" elif "mapping" in error_str or "field" in error_str: error_message += f"šŸ—‚ļø **Mapping Error**: Document structure conflicts with index mapping\n" error_message += f"šŸ“ Document fields don't match the expected index schema\n" error_message += f"šŸ’” Try: Adjust document structure or update index mapping\n\n" elif "version" in error_str or "conflict" in error_str: error_message += f"⚔ **Version Conflict**: Document already exists with different version\n" error_message += f"šŸ“ Another process modified this document simultaneously\n" error_message += f"šŸ’” Try: Use 'get_document' first, then update with latest version\n\n" elif "timeout" in error_str: error_message += "ā±ļø **Timeout Error**: Indexing operation timed out\n" error_message += f"šŸ“ Document may be too large or index overloaded\n" error_message += f"šŸ’” Try: Reduce document size or retry later\n\n" else: error_message += f"āš ļø **Unknown Error**: {str(e)}\n\n" error_message += f"šŸ” **Technical Details**: {str(e)}" return error_message
  • Document schema validation function validate_document_structure, loaded from config.json. Enforces required fields, data types, priority/source_type enums, ID format, timestamp format, non-empty lists/strings, and strict mode options. Called by index_document when validate_schema=True.
    def validate_document_structure(document: Dict[str, Any], base_directory: str = None, is_knowledge_doc: bool = True) -> Dict[str, Any]: """ Validate document structure against schema with strict mode support. Args: document: Document to validate base_directory: Base directory for relative path conversion is_knowledge_doc: Whether this is a knowledge base document (default: True) Returns: Validated and normalized document Raises: DocumentValidationError: If validation fails """ errors = [] validation_config = load_validation_config() document_schema = load_document_schema() # For knowledge base documents, check the full schema if is_knowledge_doc: # Check for extra fields if strict validation is enabled if validation_config.get("strict_schema_validation", False) and not validation_config.get("allow_extra_fields", True): allowed_fields = set(document_schema["required_fields"]) document_fields = set(document.keys()) extra_fields = document_fields - allowed_fields if extra_fields: errors.append(f"Extra fields not allowed in strict mode: {', '.join(sorted(extra_fields))}. Allowed fields: {', '.join(sorted(allowed_fields))}") else: # For non-knowledge documents, only check for extra fields if strict validation is enabled if validation_config.get("strict_schema_validation", False) and not validation_config.get("allow_extra_fields", True): # For non-knowledge docs, we don't have a predefined schema, so just enforce no extra fields beyond basic ones # This is a more lenient check - you might want to customize this based on your needs errors.append("Strict schema validation is enabled. Extra fields are not allowed for custom documents.") # Check required fields only for knowledge base documents if is_knowledge_doc: required_fields = document_schema["required_fields"] if validation_config.get("required_fields_only", False): # Only check fields that are actually required for field in required_fields: if field not in document: errors.append(f"Missing required field: {field}") else: # Check all fields in schema for field in required_fields: if field not in document: errors.append(f"Missing required field: {field}") if errors: raise DocumentValidationError("Validation failed: " + "; ".join(errors)) # For knowledge base documents, perform detailed validation if is_knowledge_doc: # Validate field types for field, expected_type in document_schema["field_types"].items(): if field in document: if not isinstance(document[field], expected_type): errors.append(f"Field '{field}' must be of type {expected_type.__name__}, got {type(document[field]).__name__}") # NEW: Validate content length if document.get("content"): content = document["content"] # Check for empty content if not content.strip(): errors.append("Content cannot be empty or contain only whitespace") # Validate priority values if document.get("priority") not in document_schema["priority_values"]: errors.append(f"Priority must be one of {document_schema['priority_values']}, got '{document.get('priority')}'") # Validate source_type if document.get("source_type") not in document_schema["source_types"]: errors.append(f"Source type must be one of {document_schema['source_types']}, got '{document.get('source_type')}'") # Validate ID format (should be alphanumeric with hyphens) if document.get("id") and not re.match(r'^[a-zA-Z0-9-_]+$', document["id"]): errors.append("ID must contain only alphanumeric characters, hyphens, and underscores") # Validate timestamp format if document.get("last_modified"): try: datetime.fromisoformat(document["last_modified"].replace('Z', '+00:00')) except ValueError: errors.append("last_modified must be in ISO 8601 format (e.g., '2025-01-04T10:30:00Z')") # Validate tags (must be non-empty strings) if document.get("tags"): for i, tag in enumerate(document["tags"]): if not isinstance(tag, str) or not tag.strip(): errors.append(f"Tag at index {i} must be a non-empty string") # Validate related documents (must be strings) if document.get("related"): for i, related_id in enumerate(document["related"]): if not isinstance(related_id, str) or not related_id.strip(): errors.append(f"Related document ID at index {i} must be a non-empty string") # Validate key_points (must be non-empty strings) if document.get("key_points"): for i, point in enumerate(document["key_points"]): if not isinstance(point, str) or not point.strip(): errors.append(f"Key point at index {i} must be a non-empty string") if errors: raise DocumentValidationError("Validation failed: " + "; ".join(errors)) return document
  • Mounting of the elasticsearch_document sub-server app (containing index_document) into the unified elasticsearch_server app at line 48: app.mount(document_app). Makes the tool available in the elasticsearch namespace.
    from .sub_servers.elasticsearch_document import app as document_app from .sub_servers.elasticsearch_index import app as index_app from .sub_servers.elasticsearch_search import app as search_app from .sub_servers.elasticsearch_batch import app as batch_app # Create unified FastMCP application app = FastMCP( name="AgentKnowledgeMCP-Elasticsearch", version="2.0.0", instructions="Unified Elasticsearch tools for comprehensive knowledge management via modular server mounting" ) # ================================ # SERVER MOUNTING - MODULAR ARCHITECTURE # ================================ print("šŸ—ļø Mounting Elasticsearch sub-servers...") # Mount all sub-servers into unified interface app.mount(snapshots_app) # 3 tools: snapshot management app.mount(index_metadata_app) # 3 tools: metadata governance app.mount(document_app) # 3 tools: document operations
  • Mounting of the elasticsearch_server app (including index_document) into the main FastMCP server app: app.mount(elasticsearch_server_app). Exposes the tool globally without prefix for backward compatibility.
    app.mount(elasticsearch_server_app)
  • AI-powered duplicate/content similarity checker check_content_similarity_with_ai used by index_document. Performs Elasticsearch similarity search followed by LLM analysis to recommend CREATE/UPDATE/DELETE/MERGE actions with confidence scores and reasoning.
    async def check_content_similarity_with_ai(es, index: str, title: str, content: str, ctx: Context, similarity_threshold: float = 0.7) -> dict: """ Advanced content similarity checking using AI analysis. Returns recommendations for UPDATE, DELETE, CREATE, or MERGE actions. """ try: # First, find potentially similar documents using Elasticsearch similar_docs = [] # Search for documents with similar titles or content if len(content) > 100: search_query = { "query": { "bool": { "should": [ {"match": {"title": {"query": title, "boost": 3.0}}}, {"match": {"content": {"query": content[:500], "boost": 1.0}}}, {"more_like_this": { "fields": ["content", "title"], "like": content[:1000], "min_term_freq": 1, "max_query_terms": 8, "minimum_should_match": "30%" }} ] } }, "size": 5, "_source": ["title", "summary", "content", "last_modified", "id"] } result = es.search(index=index, body=search_query) # Collect similar documents for hit in result['hits']['hits']: source = hit['_source'] similar_docs.append({ "id": hit['_id'], "title": source.get('title', ''), "summary": source.get('summary', '')[:200] + "..." if len(source.get('summary', '')) > 200 else source.get('summary', ''), "content_preview": source.get('content', '')[:300] + "..." if len(source.get('content', '')) > 300 else source.get('content', ''), "last_modified": source.get('last_modified', ''), "elasticsearch_score": hit['_score'] }) # If no similar documents found, recommend CREATE if not similar_docs: return { "action": "CREATE", "confidence": 0.95, "reason": "No similar content found in knowledge base", "similar_docs": [], "ai_analysis": "Content appears to be unique and should be created as new document" } # Use AI to analyze content similarity and recommend action ai_prompt = f"""You are an intelligent duplicate detection system. Analyze the new document against existing similar documents and recommend the best action. NEW DOCUMENT: Title: {title} Content: {content[:1500]}{"..." if len(content) > 1500 else ""} EXISTING SIMILAR DOCUMENTS: """ for i, doc in enumerate(similar_docs[:3], 1): ai_prompt += f""" Document {i}: {doc['title']} (ID: {doc['id']}) Summary: {doc['summary']} Content Preview: {doc['content_preview']} Last Modified: {doc['last_modified']} ---""" ai_prompt += f""" Please analyze and provide: 1. Content similarity percentage (0-100%) for each existing document 2. Recommended action: UPDATE, DELETE, CREATE, or MERGE 3. Detailed reasoning for your recommendation 4. Which specific document to update/merge with (if applicable) Guidelines: - UPDATE: If new content is an improved version of existing content (>70% similar) - DELETE: If existing content is clearly superior and new content adds no value (>85% similar) - MERGE: If both contents have valuable unique information (40-70% similar) - CREATE: If content is sufficiently different and valuable (<40% similar) Respond in JSON format: {{ "similarity_scores": [85, 60, 20], "recommended_action": "UPDATE|DELETE|CREATE|MERGE", "confidence": 0.85, "target_document_id": "doc-id-if-update-or-merge", "reasoning": "Detailed explanation of why this action is recommended", "merge_strategy": "How to combine documents if MERGE is recommended" }} Consider: - Content quality and completeness - Information uniqueness and value - Documentation freshness and accuracy - Knowledge base organization""" # Get AI analysis response = await ctx.sample( messages=ai_prompt, system_prompt="You are an expert knowledge management AI. Analyze content similarity and recommend the optimal action to maintain a high-quality, organized knowledge base. Always respond with valid JSON.", model_preferences=["claude-3-opus", "claude-3-sonnet", "gpt-4"], temperature=0.3, max_tokens=600 ) # Parse AI response ai_analysis = json.loads(response.text.strip()) # Add similar documents to response ai_analysis["similar_docs"] = similar_docs ai_analysis["ai_analysis"] = response.text return ai_analysis except Exception as e: # Fallback to simple duplicate check if AI analysis fails return { "action": "CREATE", "confidence": 0.6, "reason": f"AI analysis failed ({str(e)}), defaulting to CREATE", "similar_docs": similar_docs if 'similar_docs' in locals() else [], "ai_analysis": f"Error during AI analysis: {str(e)}" }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/itshare4u/AgentKnowledgeMCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server