Code Graph Knowledge System

documents.md•14.3 KiB

# Document Processing Guide Learn how to add, manage, and optimize documents in the Knowledge RAG system. ## Overview The document processing pipeline transforms your documents into an intelligent knowledge graph: 1. **Ingestion**: Read document content 2. **Chunking**: Split into semantic chunks 3. **Embedding**: Convert to vector representations 4. **Storage**: Save to Neo4j with vector index 5. **Indexing**: Create search-optimized structures ## Document Processing Methods ### 1. Direct Content (add_document) Add document content directly as a string. #### MCP Tool Usage ```json { "tool": "add_document", "input": { "content": "Your document content here...", "title": "Document Title", "metadata": { "author": "John Doe", "category": "tutorial", "tags": ["python", "tutorial"] } } } ``` #### HTTP API Usage ```bash curl -X POST http://localhost:8000/api/v1/knowledge/add \ -H "Content-Type: application/json" \ -d '{ "content": "Your document content...", "title": "Document Title", "metadata": {"category": "tutorial"} }' ``` #### Python Client Usage ```python import httpx async with httpx.AsyncClient() as client: response = await client.post( "http://localhost:8000/api/v1/knowledge/add", json={ "content": "Document content...", "title": "My Document", "metadata": {"type": "article"} } ) result = response.json() print(f"Added: {result}") ``` #### Size-Based Behavior - **Small documents** (<10KB): Processed synchronously ```json { "success": true, "message": "Document added successfully", "node_id": "abc123..." } ``` - **Large documents** (≥10KB): Queued for async processing ```json { "success": true, "async": true, "task_id": "task_xyz789", "message": "Large document queued (size: 25600 bytes)" } ``` ### 2. File Upload (add_file) Process files from the filesystem. #### Supported File Types - **Text files**: .txt, .md, .rst, .log - **Code files**: .py, .js, .java, .go, .rs, .cpp, .c, .h - **Documentation**: .pdf, .html, .xml - **Data files**: .json, .yaml, .yml, .toml, .csv #### MCP Tool Usage ```json { "tool": "add_file", "input": { "file_path": "/path/to/document.md" } } ``` #### HTTP API Usage ```bash curl -X POST http://localhost:8000/api/v1/knowledge/add-file \ -H "Content-Type: application/json" \ -d '{"file_path": "/path/to/document.md"}' ``` #### Example Response ```json { "success": true, "message": "File processed successfully", "file_path": "/path/to/document.md", "chunks_created": 12, "node_id": "file_node_123" } ``` ### 3. Directory Batch Processing (add_directory) Process multiple files from a directory. #### MCP Tool Usage ```json { "tool": "add_directory", "input": { "directory_path": "/path/to/docs", "recursive": true } } ``` #### HTTP API Usage ```bash curl -X POST http://localhost:8000/api/v1/knowledge/add-directory \ -H "Content-Type: application/json" \ -d '{ "directory_path": "/path/to/docs", "recursive": true }' ``` #### Features - **Recursive scanning**: Process all subdirectories - **File filtering**: Automatic filtering by extension - **Async processing**: Always queued as background task - **Progress tracking**: Monitor via task queue #### Example Response ```json { "success": true, "async": true, "task_id": "dir_task_456", "message": "Directory processing queued: /path/to/docs" } ``` ## Task Monitoring Large documents and directory processing are handled asynchronously. Monitor progress using task queue tools. ### Get Task Status ```json { "tool": "get_task_status", "input": { "task_id": "task_xyz789" } } ``` ### Watch Task Progress ```json { "tool": "watch_task", "input": { "task_id": "task_xyz789", "timeout": 300 } } ``` ### Task Lifecycle ``` PENDING → PROCESSING → COMPLETED ↓ ↓ FAILED FAILED ``` ## Document Metadata Metadata enriches documents and enables advanced filtering. ### Standard Metadata Fields ```python { "title": "Document Title", # Required "author": "Author Name", # Optional "created_at": "2024-01-15", # Auto-generated if not provided "updated_at": "2024-01-16", # Auto-updated "category": "tutorial", # Custom category "tags": ["python", "async"], # List of tags "source": "https://example.com", # Source URL "language": "en", # Content language "version": "1.0", # Document version "priority": 0.8 # Relevance priority } ``` ### Custom Metadata Add any custom fields: ```python { "metadata": { "department": "Engineering", "project": "Project Alpha", "classification": "internal", "expires_at": "2025-01-01", "custom_field": "custom_value" } } ``` ### Metadata Usage Metadata is stored as node properties and can be: - Searched in vector queries - Filtered in graph queries - Used for relationship inference - Displayed in query results ## Chunking Strategy Documents are split into chunks for optimal processing and retrieval. ### Chunking Parameters Configure in `.env`: ```bash CHUNK_SIZE=512 # Tokens per chunk CHUNK_OVERLAP=50 # Overlap between chunks (tokens) ``` ### Chunk Size Guidelines | Document Type | Recommended Chunk Size | Reasoning | |--------------|----------------------|-----------| | Technical docs | 512-1024 | Preserve code context | | Articles | 256-512 | Natural paragraph breaks | | Code files | 1024-2048 | Keep function context | | Short content | 128-256 | Small FAQs, snippets | ### Overlap Benefits Chunk overlap ensures context preservation: ``` Chunk 1: [tokens 0-512] with overlap [462-512] Chunk 2: [tokens 462-974] with overlap [924-974] Chunk 3: [tokens 924-1436] ... ``` Benefits: - ✅ Maintains sentence continuity - ✅ Preserves context across boundaries - ✅ Improves retrieval accuracy - ✅ Reduces information loss ## Embedding Generation ### Embedding Providers Choose from multiple providers: #### 1. Ollama (Local, Free) ```bash # .env configuration EMBEDDING_PROVIDER=ollama OLLAMA_EMBEDDING_MODEL=nomic-embed-text # 768 dimensions OLLAMA_BASE_URL=http://localhost:11434 # Available models: # - nomic-embed-text (768d) - Recommended # - mxbai-embed-large (1024d) - Higher quality # - all-minilm (384d) - Faster, smaller ``` #### 2. OpenAI (Cloud, High Quality) ```bash # .env configuration EMBEDDING_PROVIDER=openai OPENAI_EMBEDDING_MODEL=text-embedding-3-small # 1536 dimensions OPENAI_API_KEY=sk-... # Available models: # - text-embedding-3-small (1536d) - Cost-effective # - text-embedding-3-large (3072d) - Best quality # - text-embedding-ada-002 (1536d) - Legacy ``` #### 3. Google Gemini (Cloud, Cost-Effective) ```bash # .env configuration EMBEDDING_PROVIDER=gemini GEMINI_EMBEDDING_MODEL=models/embedding-001 # 768 dimensions GOOGLE_API_KEY=AIza... ``` #### 4. HuggingFace (Local, Customizable) ```bash # .env configuration EMBEDDING_PROVIDER=huggingface HF_EMBEDDING_MODEL=BAAI/bge-small-en-v1.5 # 384 dimensions # Popular models: # - BAAI/bge-small-en-v1.5 (384d) - Fast # - BAAI/bge-base-en-v1.5 (768d) - Balanced # - BAAI/bge-large-en-v1.5 (1024d) - Best quality ``` ### Vector Dimensions Different models produce different dimension vectors: ```bash # Configure in .env VECTOR_DIMENSION=768 # Must match your embedding model # Common dimensions: # 384 - Small models (fast, less accurate) # 768 - Medium models (balanced) # 1024 - Large models (slower, more accurate) # 1536 - OpenAI models # 3072 - OpenAI large model ``` ### Embedding Performance | Provider | Speed | Quality | Cost | Privacy | |----------|-------|---------|------|---------| | Ollama | Medium | Good | Free | 100% Private | | OpenAI | Fast | Excellent | $0.13/1M tokens | Cloud | | Gemini | Fast | Very Good | Lower cost | Cloud | | HuggingFace | Fast-Slow | Varies | Free | 100% Private | ## Neo4j Storage ### Node Structure Each document chunk becomes a Neo4j node: ```cypher (:Document { id: "doc_123", title: "Document Title", content: "Chunk content...", embedding: [0.123, 0.456, ...], // Vector embedding chunk_index: 0, total_chunks: 10, metadata: {...}, created_at: datetime(), updated_at: datetime() }) ``` ### Relationships Documents can have relationships: ```cypher // Document parts (doc:Document)-[:HAS_CHUNK]->(chunk:DocumentChunk) // Document references (doc1:Document)-[:REFERENCES]->(doc2:Document) // Topic relationships (doc:Document)-[:ABOUT]->(topic:Topic) // Source relationships (doc:Document)-[:FROM_FILE]->(file:File) ``` ### Vector Index Neo4j creates a vector index for fast similarity search: ```cypher // Created automatically CREATE VECTOR INDEX document_embeddings FOR (d:Document) ON d.embedding OPTIONS { indexConfig: { `vector.dimensions`: 768, `vector.similarity_function`: 'cosine' } } ``` ## Document Management ### List Documents ```bash # HTTP API curl http://localhost:8000/api/v1/knowledge/documents ``` Response: ```json { "documents": [ { "id": "doc_123", "title": "Document Title", "chunks": 10, "created_at": "2024-01-15T10:00:00Z", "metadata": {...} } ], "total": 1 } ``` ### Get Document Details ```bash # HTTP API curl http://localhost:8000/api/v1/knowledge/documents/doc_123 ``` ### Update Document ```bash # HTTP API curl -X PUT http://localhost:8000/api/v1/knowledge/documents/doc_123 \ -H "Content-Type: application/json" \ -d '{ "title": "Updated Title", "metadata": {"updated": true} }' ``` ### Delete Document ```bash # HTTP API curl -X DELETE http://localhost:8000/api/v1/knowledge/documents/doc_123 ``` **Note**: Deletion removes the document and all its chunks from the graph. ## Best Practices ### 1. Document Preparation **Clean your content**: ```python # Remove excessive whitespace content = " ".join(content.split()) # Remove special characters if needed import re content = re.sub(r'[^\w\s\.\,\!\?]', '', content) # Normalize line endings content = content.replace('\r\n', '\n') ``` ### 2. Metadata Strategy **Use consistent metadata**: ```python # Good: Consistent structure metadata = { "type": "tutorial", # Always use "type" "difficulty": "beginner", # Standardized values "tags": ["python", "async"] # Normalized tags } # Bad: Inconsistent metadata = { "kind": "tutorial", # Different field name "level": "easy", # Different values "categories": "python" # Wrong type } ``` ### 3. Batch Processing **Process large collections efficiently**: ```python # Good: Use directory processing add_directory("/docs", recursive=True) # Avoid: Individual file uploads for file in files: # Don't do this for many files add_file(file) ``` ### 4. Error Handling **Handle failures gracefully**: ```python try: result = await add_document(content, title) if not result["success"]: logger.error(f"Failed: {result['error']}") except Exception as e: logger.error(f"Exception: {e}") ``` ### 5. Resource Management **Monitor system resources**: - Check task queue length - Monitor Neo4j memory usage - Track embedding generation time - Watch disk space ## Troubleshooting ### Issue: Document Not Found **Symptom**: Queries don't return expected document **Solutions**: 1. Verify document was added successfully 2. Check embeddings were generated 3. Verify vector index exists 4. Try different query terms ### Issue: Slow Processing **Symptom**: Documents take long to process **Solutions**: 1. Check chunk size (reduce if too large) 2. Verify embedding provider is responsive 3. Monitor Neo4j performance 4. Use async processing for large docs ### Issue: Poor Search Results **Symptom**: Queries return irrelevant documents **Solutions**: 1. Adjust chunk size/overlap 2. Try different embedding model 3. Add more descriptive metadata 4. Use different query mode (hybrid/vector/graph) ### Issue: Out of Memory **Symptom**: Embedding generation fails **Solutions**: 1. Reduce batch size 2. Increase system memory 3. Use smaller embedding model 4. Process documents in smaller batches ## Advanced Techniques ### 1. Custom Document Loaders Create specialized loaders for custom formats: ```python from llama_index.core import Document def load_custom_format(file_path): # Your custom parsing logic content = parse_custom_file(file_path) return Document( text=content, metadata={ "source": file_path, "format": "custom" } ) ``` ### 2. Document Versioning Track document versions: ```python metadata = { "version": "2.0", "previous_version": "1.0", "change_summary": "Updated API examples", "updated_by": "user@example.com" } ``` ### 3. Multi-language Support Process documents in multiple languages: ```python # Specify language in metadata metadata = { "language": "es", # Spanish "original_language": "en", "translated": True } # Use language-specific embedding models EMBEDDING_MODEL = "paraphrase-multilingual-mpnet-base-v2" ``` ### 4. Incremental Updates Update specific document chunks: ```python # Add new version while keeping history new_content = updated_document_content metadata = { "replaces": "old_doc_id", "version": "2.0" } add_document(new_content, metadata=metadata) ``` ## Performance Optimization ### Embedding Cache Cache embeddings for repeated content: ```python # Configure cache in .env ENABLE_EMBEDDING_CACHE=true CACHE_SIZE=10000 # Number of embeddings to cache ``` ### Batch Processing Process multiple documents in batches: ```python # Use directory processing for efficiency add_directory("/docs", recursive=True) # Or implement custom batching for batch in chunks(documents, batch_size=10): process_batch(batch) ``` ### Parallel Processing Enable parallel processing for large collections: ```bash # Configure in .env MAX_WORKERS=4 # Parallel document processing threads ``` ## Next Steps - **[Query Guide](query.md)**: Learn to query your knowledge base effectively - **[MCP Integration](../mcp/overview.md)**: Connect to AI assistants ## Additional Resources - **LlamaIndex Documentation**: https://docs.llamaindex.ai/ - **Neo4j Vector Search**: https://neo4j.com/docs/cypher-manual/current/indexes-for-vector-search/ - **Embedding Models**: https://huggingface.co/spaces/mteb/leaderboard

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/royisme/codebase-rag'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

documents.md•14.3 KiB