Skip to main content
Glama
itshare4u

Agent Knowledge MCP

index_document

Add documents to Elasticsearch with duplicate detection and automatic ID generation to organize knowledge base content.

Instructions

Index a document into Elasticsearch with smart duplicate prevention and intelligent document ID generation. πŸ’‘ RECOMMENDED: Use 'create_document_template' tool first to generate a proper document structure and avoid validation errors.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
indexYesName of the Elasticsearch index to store the document
documentYesDocument data to index as JSON object. πŸ’‘ RECOMMENDED: Use 'create_document_template' tool first to generate proper document format.
doc_idNoOptional document ID - if not provided, smart ID will be generated
validate_schemaNoWhether to validate document structure for knowledge base format
check_duplicatesNoCheck for existing documents with similar title before indexing
force_indexNoForce indexing even if potential duplicates are found. πŸ’‘ TIP: Set to True if content is genuinely new and not in knowledge base to avoid multiple tool calls
use_ai_similarityNoUse AI to analyze content similarity and provide intelligent recommendations

Implementation Reference

  • Core implementation of the index_document tool. Handles Elasticsearch indexing with advanced features: smart duplicate detection via title matching, AI content similarity analysis, optional schema validation, auto-generated document IDs with collision avoidance, force indexing option, and comprehensive user-friendly error messages with actionable suggestions.
    @app.tool(
        description="Index a document into Elasticsearch with smart duplicate prevention and intelligent document ID generation. πŸ’‘ RECOMMENDED: Use 'create_document_template' tool first to generate a proper document structure and avoid validation errors.",
        tags={"elasticsearch", "index", "document", "validation", "duplicate-prevention"}
    )
    async def index_document(
            index: Annotated[str, Field(description="Name of the Elasticsearch index to store the document")],
            document: Annotated[Dict[str, Any], Field(description="Document data to index as JSON object. πŸ’‘ RECOMMENDED: Use 'create_document_template' tool first to generate proper document format.")],
            doc_id: Annotated[Optional[str], Field(
                description="Optional document ID - if not provided, smart ID will be generated")] = None,
            validate_schema: Annotated[
                bool, Field(description="Whether to validate document structure for knowledge base format")] = True,
            check_duplicates: Annotated[
                bool, Field(description="Check for existing documents with similar title before indexing")] = True,
            force_index: Annotated[
                bool, Field(description="Force indexing even if potential duplicates are found. πŸ’‘ TIP: Set to True if content is genuinely new and not in knowledge base to avoid multiple tool calls")] = False,
            use_ai_similarity: Annotated[bool, Field(
                description="Use AI to analyze content similarity and provide intelligent recommendations")] = True,
            ctx: Context = None
    ) -> str:
        """Index a document into Elasticsearch with smart duplicate prevention."""
        try:
            es = get_es_client()
    
            # Smart duplicate checking if enabled
            if check_duplicates and not force_index:
                title = document.get('title', '')
                content = document.get('content', '')
    
                if title:
                    # First check simple title duplicates
                    dup_check = check_title_duplicates(es, index, title)
                    if dup_check['found']:
                        duplicates_info = "\n".join([
                            f"   πŸ“„ {dup['title']} (ID: {dup['id']})\n      πŸ“ {dup['summary']}\n      πŸ“… {dup['last_modified']}"
                            for dup in dup_check['duplicates'][:3]
                        ])
    
                        # Use AI similarity analysis if enabled and content is substantial
                        if use_ai_similarity and content and len(content) > 200 and ctx:
                            try:
                                ai_analysis = await check_content_similarity_with_ai(es, index, title, content, ctx)
    
                                action = ai_analysis.get('action', 'CREATE')
                                confidence = ai_analysis.get('confidence', 0.5)
                                reasoning = ai_analysis.get('reasoning', 'AI analysis completed')
                                target_doc = ai_analysis.get('target_document_id', '')
    
                                ai_message = f"\n\nπŸ€– **AI Content Analysis** (Confidence: {confidence:.0%}):\n"
                                ai_message += f"   🎯 **Recommended Action**: {action}\n"
                                ai_message += f"   πŸ’­ **AI Reasoning**: {reasoning}\n"
    
                                if action == "UPDATE" and target_doc:
                                    ai_message += f"   πŸ“„ **Target Document**: {target_doc}\n"
                                    ai_message += f"   πŸ’‘ **Suggestion**: Update existing document instead of creating new one\n"
    
                                elif action == "DELETE":
                                    ai_message += f"   πŸ—‘οΈ **AI Recommendation**: Existing content is superior, consider not creating this document\n"
    
                                elif action == "MERGE" and target_doc:
                                    ai_message += f"   πŸ”„ **Merge Target**: {target_doc}\n"
                                    ai_message += f"   πŸ“ **Strategy**: {ai_analysis.get('merge_strategy', 'Combine unique information from both documents')}\n"
    
                                elif action == "CREATE":
                                    ai_message += f"   βœ… **AI Approval**: Content is sufficiently unique to create new document\n"
                                    # If AI says CREATE, allow automatic indexing
                                    pass
    
                                # Show similar documents found by AI
                                similar_docs = ai_analysis.get('similar_docs', [])
                                if similar_docs:
                                    ai_message += f"\n   πŸ“‹ **Similar Documents Analyzed**:\n"
                                    for i, doc in enumerate(similar_docs[:2], 1):
                                        ai_message += f"      {i}. {doc['title']} (Score: {doc.get('elasticsearch_score', 0):.1f})\n"
    
                                # If AI recommends CREATE with high confidence, proceed automatically
                                if action == "CREATE" and confidence > 0.8:
                                    # Continue with indexing - don't return early
                                    pass
                                else:
                                    # Return AI analysis for user review
                                    return (
                                            f"⚠️ **Potential Duplicates Found** - {dup_check['count']} similar document(s):\n\n" +
                                            f"{duplicates_info}\n" +
                                            f"{ai_message}\n\n" +
                                            f"πŸ€” **What would you like to do?**\n" +
                                            f"   1️⃣ **FOLLOW AI RECOMMENDATION**: {action} as suggested by AI\n" +
                                            f"   2️⃣ **UPDATE existing document**: Modify one of the above instead\n" +
                                            f"   3️⃣ **SEARCH for more**: Use search tool to find all related content\n" +
                                            f"   4️⃣ **FORCE CREATE anyway**: Set force_index=True if this is truly unique\n\n" +
                                            f"πŸ’‘ **AI Recommendation**: {reasoning}\n" +
                                            f"πŸ” **Next Step**: Search for '{title}' to see all related documents\n\n" +
                                            f"⚑ **To force indexing**: Call again with force_index=True")
    
                            except Exception as ai_error:
                                # Fallback to simple duplicate check if AI fails
                                return (
                                        f"⚠️ **Potential Duplicates Found** - {dup_check['count']} similar document(s):\n\n" +
                                        f"{duplicates_info}\n\n" +
                                        f"⚠️ **AI Analysis Failed**: {str(ai_error)}\n\n" +
                                        f"πŸ€” **What would you like to do?**\n" +
                                        f"   1️⃣ **UPDATE existing document**: Modify one of the above instead\n" +
                                        f"   2️⃣ **SEARCH for more**: Use search tool to find all related content\n" +
                                        f"   3️⃣ **FORCE CREATE anyway**: Set force_index=True if this is truly unique\n\n" +
                                        f"πŸ’‘ **Recommendation**: Update existing documents to prevent knowledge base bloat\n" +
                                        f"πŸ” **Next Step**: Search for '{title}' to see all related documents\n\n" +
                                        f"⚑ **To force indexing**: Call again with force_index=True")
    
                        else:
                            # Simple duplicate check without AI
                            return (f"⚠️ **Potential Duplicates Found** - {dup_check['count']} similar document(s):\n\n" +
                                    f"{duplicates_info}\n\n" +
                                    f"πŸ€” **What would you like to do?**\n" +
                                    f"   1️⃣ **UPDATE existing document**: Modify one of the above instead\n" +
                                    f"   2️⃣ **SEARCH for more**: Use search tool to find all related content\n" +
                                    f"   3️⃣ **FORCE CREATE anyway**: Set force_index=True if this is truly unique\n\n" +
                                    f"πŸ’‘ **Recommendation**: Update existing documents to prevent knowledge base bloat\n" +
                                    f"πŸ” **Next Step**: Search for '{title}' to see all related documents\n\n" +
                                    f"⚑ **To force indexing**: Call again with force_index=True")
    
            # Generate smart document ID if not provided
            if not doc_id:
                existing_ids = get_existing_document_ids(es, index)
                doc_id = generate_smart_doc_id(
                    document.get('title', 'untitled'),
                    document.get('content', ''),
                    existing_ids
                )
                document['id'] = doc_id  # Ensure document has the ID
    
            # Validate document structure if requested
            if validate_schema:
                try:
                    # Check if this looks like a knowledge base document
                    if isinstance(document, dict) and "id" in document and "title" in document:
                        validated_doc = validate_document_structure(document)
                        document = validated_doc
    
                        # Use the document ID from the validated document if not provided earlier
                        if not doc_id:
                            doc_id = document.get("id")
    
                    else:
                        # For non-knowledge base documents, still validate with strict mode if enabled
                        validated_doc = validate_document_structure(document, is_knowledge_doc=False)
                        document = validated_doc
                except DocumentValidationError as e:
                    return f"❌ Validation failed:\n\n{format_validation_error(e)}"
                except Exception as e:
                    return f"❌ Validation error: {str(e)}"
    
            # Index the document
            result = es.index(index=index, id=doc_id, body=document)
    
            success_message = f"βœ… Document indexed successfully:\n\n{json.dumps(result, indent=2, ensure_ascii=False)}"
    
            # Add smart guidance based on indexing result
            if result.get('result') == 'created':
                success_message += f"\n\nπŸŽ‰ **New Document Created**:\n"
                success_message += f"   πŸ“„ **Document ID**: {doc_id}\n"
                success_message += f"   πŸ†” **ID Strategy**: {'User-provided' if 'doc_id' in locals() and doc_id else 'Smart-generated'}\n"
                if check_duplicates:
                    success_message += f"   βœ… **Duplicate Check**: Passed - no similar titles found\n"
            else:
                success_message += f"\n\nπŸ”„ **Document Updated**:\n"
                success_message += f"   πŸ“„ **Document ID**: {doc_id}\n"
                success_message += f"   ⚑ **Action**: Replaced existing document with same ID\n"
    
            success_message += (f"\n\nπŸ’‘ **Smart Duplicate Prevention Active**:\n" +
                                f"   πŸ” **Auto-Check**: {'Enabled' if check_duplicates else 'Disabled'} - searches for similar titles\n" +
                                f"   πŸ€– **AI Analysis**: {'Enabled' if use_ai_similarity else 'Disabled'} - intelligent content similarity detection\n" +
                                f"   πŸ†” **Smart IDs**: Auto-generated from title with collision detection\n" +
                                f"   ⚑ **Force Option**: Use force_index=True to bypass duplicate warnings\n" +
                                f"   πŸ”„ **Update Recommended**: Modify existing documents instead of creating duplicates\n\n" +
                                f"🀝 **Best Practices**:\n" +
                                f"   β€’ Search before creating: 'search(index=\"{index}\", query=\"your topic\")'\n" +
                                f"   β€’ Update existing documents when possible\n" +
                                f"   β€’ Use descriptive titles for better smart ID generation\n" +
                                f"   β€’ AI will analyze content similarity for intelligent recommendations\n" +
                                f"   β€’ Set force_index=True only when content is truly unique")
    
            return success_message
    
        except Exception as e:
            # Provide detailed error messages for different types of Elasticsearch errors
            error_message = "❌ Document indexing failed:\n\n"
    
            error_str = str(e).lower()
            if "connection" in error_str or "refused" in error_str:
                error_message += "πŸ”Œ **Connection Error**: Cannot connect to Elasticsearch server\n"
                error_message += f"πŸ“ Check if Elasticsearch is running at the configured address\n"
                error_message += f"πŸ’‘ Try: Use 'setup_elasticsearch' tool to start Elasticsearch\n\n"
            elif ("index" in error_str and "not found" in error_str) or "index_not_found_exception" in error_str:
                error_message += f"πŸ“ **Index Error**: Index '{index}' does not exist\n"
                error_message += f"πŸ“ The target index has not been created yet\n"
                error_message += f"πŸ’‘ **Suggestions for agents**:\n"
                error_message += f"   1. Use 'create_index' tool to create the index first\n"
                error_message += f"   2. Use 'list_indices' to see available indices\n"
                error_message += f"   3. Check the correct index name for your data type\n\n"
            elif "mapping" in error_str or "field" in error_str:
                error_message += f"πŸ—‚οΈ **Mapping Error**: Document structure conflicts with index mapping\n"
                error_message += f"πŸ“ Document fields don't match the expected index schema\n"
                error_message += f"πŸ’‘ Try: Adjust document structure or update index mapping\n\n"
            elif "version" in error_str or "conflict" in error_str:
                error_message += f"⚑ **Version Conflict**: Document already exists with different version\n"
                error_message += f"πŸ“ Another process modified this document simultaneously\n"
                error_message += f"πŸ’‘ Try: Use 'get_document' first, then update with latest version\n\n"
            elif "timeout" in error_str:
                error_message += "⏱️ **Timeout Error**: Indexing operation timed out\n"
                error_message += f"πŸ“ Document may be too large or index overloaded\n"
                error_message += f"πŸ’‘ Try: Reduce document size or retry later\n\n"
            else:
                error_message += f"⚠️ **Unknown Error**: {str(e)}\n\n"
    
            error_message += f"πŸ” **Technical Details**: {str(e)}"
    
            return error_message
  • Document schema validation function validate_document_structure, loaded from config.json. Enforces required fields, data types, priority/source_type enums, ID format, timestamp format, non-empty lists/strings, and strict mode options. Called by index_document when validate_schema=True.
    def validate_document_structure(document: Dict[str, Any], base_directory: str = None, is_knowledge_doc: bool = True) -> Dict[str, Any]:
        """
        Validate document structure against schema with strict mode support.
        
        Args:
            document: Document to validate
            base_directory: Base directory for relative path conversion
            is_knowledge_doc: Whether this is a knowledge base document (default: True)
            
        Returns:
            Validated and normalized document
            
        Raises:
            DocumentValidationError: If validation fails
        """
        errors = []
        validation_config = load_validation_config()
        document_schema = load_document_schema()
        
        # For knowledge base documents, check the full schema
        if is_knowledge_doc:
            # Check for extra fields if strict validation is enabled
            if validation_config.get("strict_schema_validation", False) and not validation_config.get("allow_extra_fields", True):
                allowed_fields = set(document_schema["required_fields"])
                document_fields = set(document.keys())
                extra_fields = document_fields - allowed_fields
                
                if extra_fields:
                    errors.append(f"Extra fields not allowed in strict mode: {', '.join(sorted(extra_fields))}. Allowed fields: {', '.join(sorted(allowed_fields))}")
        else:
            # For non-knowledge documents, only check for extra fields if strict validation is enabled
            if validation_config.get("strict_schema_validation", False) and not validation_config.get("allow_extra_fields", True):
                # For non-knowledge docs, we don't have a predefined schema, so just enforce no extra fields beyond basic ones
                # This is a more lenient check - you might want to customize this based on your needs
                errors.append("Strict schema validation is enabled. Extra fields are not allowed for custom documents.")
        
        # Check required fields only for knowledge base documents
        if is_knowledge_doc:
            required_fields = document_schema["required_fields"]
            if validation_config.get("required_fields_only", False):
                # Only check fields that are actually required
                for field in required_fields:
                    if field not in document:
                        errors.append(f"Missing required field: {field}")
            else:
                # Check all fields in schema
                for field in required_fields:
                    if field not in document:
                        errors.append(f"Missing required field: {field}")
        
        if errors:
            raise DocumentValidationError("Validation failed: " + "; ".join(errors))
        
        # For knowledge base documents, perform detailed validation
        if is_knowledge_doc:
            # Validate field types
            for field, expected_type in document_schema["field_types"].items():
                if field in document:
                    if not isinstance(document[field], expected_type):
                        errors.append(f"Field '{field}' must be of type {expected_type.__name__}, got {type(document[field]).__name__}")
            
            # NEW: Validate content length
            if document.get("content"):
                content = document["content"]
                
                # Check for empty content
                if not content.strip():
                    errors.append("Content cannot be empty or contain only whitespace")
            
            # Validate priority values
            if document.get("priority") not in document_schema["priority_values"]:
                errors.append(f"Priority must be one of {document_schema['priority_values']}, got '{document.get('priority')}'")
            
            # Validate source_type
            if document.get("source_type") not in document_schema["source_types"]:
                errors.append(f"Source type must be one of {document_schema['source_types']}, got '{document.get('source_type')}'")
            
            # Validate ID format (should be alphanumeric with hyphens)
            if document.get("id") and not re.match(r'^[a-zA-Z0-9-_]+$', document["id"]):
                errors.append("ID must contain only alphanumeric characters, hyphens, and underscores")
            
            # Validate timestamp format
            if document.get("last_modified"):
                try:
                    datetime.fromisoformat(document["last_modified"].replace('Z', '+00:00'))
                except ValueError:
                    errors.append("last_modified must be in ISO 8601 format (e.g., '2025-01-04T10:30:00Z')")
            
            # Validate tags (must be non-empty strings)
            if document.get("tags"):
                for i, tag in enumerate(document["tags"]):
                    if not isinstance(tag, str) or not tag.strip():
                        errors.append(f"Tag at index {i} must be a non-empty string")
            
            # Validate related documents (must be strings)
            if document.get("related"):
                for i, related_id in enumerate(document["related"]):
                    if not isinstance(related_id, str) or not related_id.strip():
                        errors.append(f"Related document ID at index {i} must be a non-empty string")
            
            # Validate key_points (must be non-empty strings)
            if document.get("key_points"):
                for i, point in enumerate(document["key_points"]):
                    if not isinstance(point, str) or not point.strip():
                        errors.append(f"Key point at index {i} must be a non-empty string")
        
        if errors:
            raise DocumentValidationError("Validation failed: " + "; ".join(errors))
        
        return document
  • Mounting of the elasticsearch_document sub-server app (containing index_document) into the unified elasticsearch_server app at line 48: app.mount(document_app). Makes the tool available in the elasticsearch namespace.
    from .sub_servers.elasticsearch_document import app as document_app
    from .sub_servers.elasticsearch_index import app as index_app
    from .sub_servers.elasticsearch_search import app as search_app
    from .sub_servers.elasticsearch_batch import app as batch_app
    
    # Create unified FastMCP application
    app = FastMCP(
        name="AgentKnowledgeMCP-Elasticsearch",
        version="2.0.0",
        instructions="Unified Elasticsearch tools for comprehensive knowledge management via modular server mounting"
    )
    
    # ================================
    # SERVER MOUNTING - MODULAR ARCHITECTURE
    # ================================
    
    print("πŸ—οΈ Mounting Elasticsearch sub-servers...")
    
    # Mount all sub-servers into unified interface
    app.mount(snapshots_app)           # 3 tools: snapshot management
    app.mount(index_metadata_app)      # 3 tools: metadata governance  
    app.mount(document_app)            # 3 tools: document operations
  • Mounting of the elasticsearch_server app (including index_document) into the main FastMCP server app: app.mount(elasticsearch_server_app). Exposes the tool globally without prefix for backward compatibility.
    app.mount(elasticsearch_server_app)
  • AI-powered duplicate/content similarity checker check_content_similarity_with_ai used by index_document. Performs Elasticsearch similarity search followed by LLM analysis to recommend CREATE/UPDATE/DELETE/MERGE actions with confidence scores and reasoning.
    async def check_content_similarity_with_ai(es, index: str, title: str, content: str, ctx: Context, similarity_threshold: float = 0.7) -> dict:
        """
        Advanced content similarity checking using AI analysis.
        Returns recommendations for UPDATE, DELETE, CREATE, or MERGE actions.
        """
        try:
            # First, find potentially similar documents using Elasticsearch
            similar_docs = []
            
            # Search for documents with similar titles or content
            if len(content) > 100:
                search_query = {
                    "query": {
                        "bool": {
                            "should": [
                                {"match": {"title": {"query": title, "boost": 3.0}}},
                                {"match": {"content": {"query": content[:500], "boost": 1.0}}},
                                {"more_like_this": {
                                    "fields": ["content", "title"],
                                    "like": content[:1000],
                                    "min_term_freq": 1,
                                    "max_query_terms": 8,
                                    "minimum_should_match": "30%"
                                }}
                            ]
                        }
                    },
                    "size": 5,
                    "_source": ["title", "summary", "content", "last_modified", "id"]
                }
                
                result = es.search(index=index, body=search_query)
                
                # Collect similar documents
                for hit in result['hits']['hits']:
                    source = hit['_source']
                    similar_docs.append({
                        "id": hit['_id'],
                        "title": source.get('title', ''),
                        "summary": source.get('summary', '')[:200] + "..." if len(source.get('summary', '')) > 200 else source.get('summary', ''),
                        "content_preview": source.get('content', '')[:300] + "..." if len(source.get('content', '')) > 300 else source.get('content', ''),
                        "last_modified": source.get('last_modified', ''),
                        "elasticsearch_score": hit['_score']
                    })
            
            # If no similar documents found, recommend CREATE
            if not similar_docs:
                return {
                    "action": "CREATE",
                    "confidence": 0.95,
                    "reason": "No similar content found in knowledge base",
                    "similar_docs": [],
                    "ai_analysis": "Content appears to be unique and should be created as new document"
                }
            
            # Use AI to analyze content similarity and recommend action
            ai_prompt = f"""You are an intelligent duplicate detection system. Analyze the new document against existing similar documents and recommend the best action.
    
    NEW DOCUMENT:
    Title: {title}
    Content: {content[:1500]}{"..." if len(content) > 1500 else ""}
    
    EXISTING SIMILAR DOCUMENTS:
    """
            
            for i, doc in enumerate(similar_docs[:3], 1):
                ai_prompt += f"""
    Document {i}: {doc['title']} (ID: {doc['id']})
    Summary: {doc['summary']}
    Content Preview: {doc['content_preview']}
    Last Modified: {doc['last_modified']}
    ---"""
            
            ai_prompt += f"""
    
    Please analyze and provide:
    1. Content similarity percentage (0-100%) for each existing document
    2. Recommended action: UPDATE, DELETE, CREATE, or MERGE
    3. Detailed reasoning for your recommendation
    4. Which specific document to update/merge with (if applicable)
    
    Guidelines:
    - UPDATE: If new content is an improved version of existing content (>70% similar)
    - DELETE: If existing content is clearly superior and new content adds no value (>85% similar)  
    - MERGE: If both contents have valuable unique information (40-70% similar)
    - CREATE: If content is sufficiently different and valuable (<40% similar)
    
    Respond in JSON format:
    {{
      "similarity_scores": [85, 60, 20],
      "recommended_action": "UPDATE|DELETE|CREATE|MERGE",
      "confidence": 0.85,
      "target_document_id": "doc-id-if-update-or-merge",
      "reasoning": "Detailed explanation of why this action is recommended",
      "merge_strategy": "How to combine documents if MERGE is recommended"
    }}
    
    Consider:
    - Content quality and completeness
    - Information uniqueness and value
    - Documentation freshness and accuracy
    - Knowledge base organization"""
    
            # Get AI analysis
            response = await ctx.sample(
                messages=ai_prompt,
                system_prompt="You are an expert knowledge management AI. Analyze content similarity and recommend the optimal action to maintain a high-quality, organized knowledge base. Always respond with valid JSON.",
                model_preferences=["claude-3-opus", "claude-3-sonnet", "gpt-4"],
                temperature=0.3,
                max_tokens=600
            )
            
            # Parse AI response
            ai_analysis = json.loads(response.text.strip())
            
            # Add similar documents to response
            ai_analysis["similar_docs"] = similar_docs
            ai_analysis["ai_analysis"] = response.text
            
            return ai_analysis
            
        except Exception as e:
            # Fallback to simple duplicate check if AI analysis fails
            return {
                "action": "CREATE",
                "confidence": 0.6,
                "reason": f"AI analysis failed ({str(e)}), defaulting to CREATE",
                "similar_docs": similar_docs if 'similar_docs' in locals() else [],
                "ai_analysis": f"Error during AI analysis: {str(e)}"
            }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/itshare4u/AgentKnowledgeMCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server