Confluence Hybrid RAG MCP Server
Provides tools for searching Confluence pages, listing spaces, and retrieving full page content, enabling AI agents to query and extract information from a Confluence instance.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Confluence Hybrid RAG MCP Serversearch for deployment procedures in Confluence"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Confluence Hybrid RAG + Agentic RAG + MCP + PROD Design
This project implements hybrid agentic RAG over Confluence using Python, bm25s, OpenAI embeddings, RRF, Cohere reranker, Pydantic-AI, and FastMCP. An MCP server and Streamlit chatbot are implemented as core functionalities, enabling Confluence search from Claude Desktop, Cursor, or Claude Code.
Table of Contents
Related MCP server: Confluence Communication Server
Quick reference — choose your retrieval approach
Pick your approach based on corpus size and the type of questions your users ask. Details, cost estimates, and AWS implementation for each option are in the Production architecture section below.
By corpus size
Confluence pages | Recommended approach | Storage needed |
< 50 000 | Contextual BM25 + reranker | OpenSearch only |
50 000 – 300 000 | Contextual BM25 + reranker | OpenSearch only |
300 000 – 1 000 000 | Contextual hybrid (BM25 + dense) + reranker | OpenSearch + vector index |
> 1 000 000 | Contextual hybrid + reranker + GraphRAG | OpenSearch + vector index + Neptune |
By question type
Questions your users ask | Best approach |
"How do I do X?" — specific how-to | Contextual BM25 + reranker |
"What is X?" — definitions, policies | Contextual BM25 + reranker |
"Find anything about X" — paraphrased, exploratory | Contextual hybrid (BM25 + dense) |
"What changed in X and why?" — history, causality | GraphRAG |
"Who owns X and what is its current status?" — ownership chains | GraphRAG |
All of the above | Contextual hybrid + GraphRAG |
All options at a glance
Approach | Quality | Ops complexity | AWS cost / month (medium scale) | Removes vector DB? |
Prototype — BM25 + dense + RRF + rerank | Good | Low | ~$1 335 | No |
Vector DB upgrade — Qdrant / OpenSearch kNN | Good | Medium | ~$1 400 | No |
Contextual BM25 — Claude context prefix + BM25 + rerank | Very good | Low | ~$1 000 | Yes |
Contextual hybrid — context prefix + BM25 + dense + rerank | Excellent | Medium | ~$1 200 | No |
GraphRAG — knowledge graph traversal | Excellent on multi-hop | High | ~$1 720 | Optional |
Contextual hybrid + GraphRAG | Best across all query types | Very high | ~$1 900 | No |
Recommended implementation order
Start with Contextual BM25 — add a Claude context-generation step before indexing, remove dense embeddings. Better quality, lower cost, less infrastructure. No changes to the MCP server, agent, or chatbot.
Add dense embeddings back when you observe users rephrasing the same question multiple ways and getting inconsistent results — the signal that BM25 recall is the bottleneck.
Add GraphRAG only after analysing real user queries for 4–6 weeks and confirming that multi-hop questions (history of a decision, ownership chains, impact of an incident) represent a significant share of traffic.
Key principle: the vector database is a scale concern, not a quality concern. Contextual Retrieval and GraphRAG address the actual quality bottlenecks. Fix quality first, then scale storage to match corpus size.
Why combine the two patterns?
What hybrid retrieval solves
Your company wiki uses different words than your users do. A user asks:
"How do we handle auth failures?" — the Confluence page is titled
"Authentication error handling". BM25 alone misses it (no word overlap).
Dense search alone misses it (rare product names get diluted).
The BM25 + dense + RRF + rerank pipeline from knowledge/hybrid-retrieval
catches both.
What agentic retrieval solves
Some questions need more than one search:
"What changed in the deployment process last quarter and why?" requires
reading the current process page, finding the ADR that changed it, and
cross-referencing an incident report. A single-shot retrieval can't do that.
The agentic loop from knowledge/agentic-rag lets the model search
multiple times, follow leads, and synthesise across pages.
What MCP adds
MCP makes the whole thing a first-class tool inside any MCP client. Claude in your IDE or chat interface can search Confluence on its own when it needs internal context — without you having to copy-paste docs into the prompt manually.
Architecture
Confluence REST API
│
▼
1-fetch-confluence.py pull pages → strip HTML → chunk → chunks/*.json
│
▼
2-build-index.py BM25 index (bm25s) + dense embeddings (OpenAI)
│ │
└────────────┬───────────┘
▼
indexes/bm25/ indexes/embeddings.npy
indexes/meta.json
│
┌────────────────────┼───────────────────────┐
▼ ▼ ▼
3-hybrid-search.py 4-agent.py 5-mcp-server.py
(interactive CLI) (pydantic-ai agent) (FastMCP → Claude)Retrieval pipeline (inside every search call)
Query
│
├─► BM25 catches exact terms, product names, ticket IDs
│ e.g. "JIRA-4521", "prod-db-01", "rerank-v4.0-fast"
│
├─► Dense catches paraphrase and synonyms
│ (cosine) e.g. "auth failure" ↔ "authentication error"
│
├─► RRF fuses the two ranked lists without score normalisation
│ (RRF sidesteps the problem that BM25 and cosine scores
│ live on completely different scales)
│
└─► Cohere rerank cross-encoder that sees query + document jointly
top-50 in much higher precision than bi-encoder dense alone
→ top-10 outAgentic loop (inside 4-agent.py and every MCP session)
User question
│
▼
list_spaces (which Confluence domains are indexed?)
│
▼
hybrid_search ──────► BM25 + dense + RRF + rerank
│
▼
snippets relevant?
├─ Yes ──► synthesise answer
└─ No ──► get_page_full (read a full page)
or hybrid_search again with a refined query
│
▼
synthesise answer + citationsMCP integration
The MCP server in 5-mcp-server.py exposes three tools:
Tool | What it does |
| Returns indexed spaces (key + name) |
| Four-stage hybrid search; returns snippets |
| Returns full text of one page by page_id |
Claude acts as the agent — it calls these tools in a loop the same way
4-agent.py does internally. No duplicate agent layer on the server.
Connect Claude Desktop
{
"mcpServers": {
"confluence": {
"url": "http://localhost:8051/sse"
}
}
}macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Connect Claude Code
Add .mcp.json in your project root:
{
"mcpServers": {
"confluence": {
"type": "sse",
"url": "http://localhost:8051/sse"
}
}
}Connect Cursor
Settings → MCP → Add server → URL: http://localhost:8051/sse
Setup
# 1. Install dependencies (from the repo root)
uv sync
# 2. Copy and fill in credentials
cp .env.example .env
# 3. Fetch Confluence pages (run with empty CONFLUENCE_SPACE_KEYS first
# to see available spaces, then set the ones you want and re-run)
uv run 1-fetch-confluence.py
# 4. Build BM25 + dense indexes (~$0.002 per 1 000 chunks)
uv run 2-build-index.py
# 5. Test retrieval interactively
uv run 3-hybrid-search.py
# 6. Try the full agent (optional)
uv run 4-agent.py "What is our deployment process?"
# 7. Start the MCP server for Claude Desktop / Cursor / Claude Code
uv run 5-mcp-server.pyYou need four API keys in .env:
CONFLUENCE_BASE_URL+CONFLUENCE_EMAIL+CONFLUENCE_API_TOKENOPENAI_API_KEY— embeddings (text-embedding-3-small)COHERE_API_KEY— reranker (rerank-v4.0-fast, free tier)ANTHROPIC_API_KEY— only needed for4-agent.py
Testing without a Confluence instance
1-fetch-k8s.py is a drop-in replacement for 1-fetch-confluence.py that
scrapes the public Kubernetes documentation (kubernetes.io/docs) instead of
your Confluence instance. It produces chunks in the exact same JSON format, so
every subsequent step — 2-build-index.py, 3-hybrid-search.py,
5-mcp-server.py, 6-chatbot.py — runs unchanged. No API keys are needed.
Prerequisites
# beautifulsoup4 is the only extra dependency
uv pip install beautifulsoup4Running the scraper
# Scrape ~190 Kubernetes docs pages (≈ 2 min at 0.5 s/request)
uv run 1-fetch-k8s.py
# Then continue with the normal pipeline
uv run 2-build-index.py
uv run 5-mcp-server.py # terminal 1
uv run streamlit run 6-chatbot.py # terminal 2Expected output:
Fetching sitemap: https://kubernetes.io/en/sitemap.xml
Found 192 pages to index
[ 1/192] Concepts 3 chunk(s)
[ 2/192] Kubernetes Components 4 chunk(s)
...
Done: 189 pages → 847 chunks (2 empty, 1 errors)
Chunks saved to: .../chunks/Configuration
All tunable constants are at the top of 1-fetch-k8s.py:
Constant | Default | What it controls |
|
| Sections of kubernetes.io/docs to crawl |
|
| Sub-paths excluded even if they fall under an included section |
|
| Pause between HTTP requests (respectful crawl rate) |
|
| Maximum characters per chunk |
|
| Overlap between consecutive chunks |
The large API reference (reference/kubernetes-api/, ~600 pages of spec
tables) is excluded by default to keep index size manageable. Add it to
INCLUDE_SECTIONS if you need API field lookups.
Output format
Each chunk is saved as a JSON file in chunks/ and uses space_key: "K8S".
The schema is identical to Confluence chunks, so 3-hybrid-search.py and the
MCP server treat them the same way:
{
"chunk_id": "docs-concepts-workloads-pods_c0",
"page_id": "docs-concepts-workloads-pods",
"title": "Pods",
"space_key": "K8S",
"space_name": "Kubernetes Docs",
"url": "https://kubernetes.io/docs/concepts/workloads/pods/",
"text": "...",
"chunk_idx": 0,
"last_modified": "Mon, 10 Jun 2024 12:00:00 GMT"
}Example queries once running
"What is the difference between a Deployment and a StatefulSet?"
"How do I configure resource limits for a Pod?"
"What happens when a node fails?"
"How does the Kubernetes scheduler decide where to place a Pod?"
Files
./
├── 1-fetch-confluence.py Production: fetch & chunk Confluence pages
├── 1-fetch-k8s.py Test/demo: scrape public Kubernetes docs
├── 2-build-index.py Build BM25 + dense indexes
├── 3-hybrid-search.py Interactive search CLI
├── 4-agent.py Pydantic-AI agent (standalone)
├── 5-mcp-server.py FastMCP server for Claude/Cursor/Claude Code
├── 6-chatbot.py Streamlit chatbot (MCP client + chat UI)
├── pyproject.toml All dependencies
├── .env.example
└── utils/
├── confluence.py ConfluenceClient + html_to_text + chunk()
├── retrieval.py HybridRetriever (BM25+dense+RRF+rerank)
└── agent_tools.py list_spaces / hybrid_search / get_page_full
shared by 4-agent.py and 5-mcp-server.pyKeeping the index fresh
Confluence changes. A simple refresh loop:
# Re-fetch changed pages and rebuild indexes (run nightly via cron)
uv run 1-fetch-confluence.py && uv run 2-build-index.pyFor incremental updates, add last_modified filtering in iter_pages() to
skip pages not changed since the last run (compare against the timestamp
stored in indexes/meta.json).
Extending
Want to add | Where to change |
Filter by Confluence label/ancestor |
|
Parent/child page navigation tool | Add |
Local reranker (offline) | Swap Cohere in |
Vector DB instead of numpy | Replace |
Incremental index updates | Track |
Evaluate retrieval quality | Use |
Production architecture (many GB of Confluence data)
The prototype works well for hundreds of pages but hits hard limits at scale. This section explains what breaks and how to redesign each layer for production.
Why the prototype does not scale
Component | Prototype | Breaks when… |
Dense index |
| >500 k chunks (~3 GB at 1536 dims, fp32) |
BM25 index |
| Corpus grows beyond a few hundred MB of text |
Sync | Full re-fetch + full re-index | Pages number in the tens of thousands |
MCP server | Single-process, no auth | Multiple concurrent users, internal tool exposure |
Access control | None — every user sees every page | Any team with Confluence page restrictions |
Chunking | Character-based with fixed overlap | Long structured pages (tables, code blocks split badly) |
Recommended production stack
Vector database
The right choice depends on where you run infrastructure. The short version: Qdrant if you want the best hybrid search engine; Amazon OpenSearch Service if you are already on AWS and want to stay AWS-native; Bedrock Knowledge Bases if you want zero ingestion pipeline work.
Option | Cloud | Choose if… | Trade-off |
Qdrant Cloud | Any (hosted on AWS/GCP/Azure infra) | Starting fresh, want managed ops | Easiest setup; data in Qdrant's account |
Qdrant on EKS/ECS | AWS | Data must stay in your AWS account; already run containers | You manage upgrades and backups |
Qdrant self-hosted | On-prem / any VM | Full control, air-gapped environments | Full operational burden |
Amazon OpenSearch Service | AWS | Already on AWS, want IAM + CloudWatch native integration | Slightly slower ANN than Qdrant at extreme scale |
Amazon Bedrock Knowledge Bases | AWS | Want zero ingestion pipeline; Confluence connector built-in | Less control over chunking, reranking, and MCP integration |
Azure AI Search | Azure | Already on Azure | Native Confluence connector; higher cost per query |
pgvector (RDS/Aurora) | Any | Small corpus (<1 M chunks), already run PostgreSQL | Simplest ops; ANN slows above ~5 M vectors |
Pinecone | Any | — | No native sparse support in standard tier; vendor lock-in |
Qdrant is the strongest general-purpose choice because it is the only option in this list with native hybrid search — dense + sparse vectors in a single query with built-in RRF — without requiring application-level result merging. For AWS-specific guidance see the Running on AWS section below.
Sparse vectors — BM42 instead of BM25
In production, replace the bm25s BM25 index with BM42 sparse vectors stored
inside Qdrant. BM42 is a neural sparse model (built into Qdrant's FastEmbed
library) that produces sparse vectors compatible with Qdrant's sparse index.
It significantly outperforms classical BM25 on paraphrased queries while
preserving the exact-term matching that dense embeddings miss.
The result is a single Qdrant collection with both a dense vector field and a
sparse vector field, queried together in one round-trip.
Embedding model
Keep text-embedding-3-small as the default (best cost/quality ratio for this
use case). Switch to text-embedding-3-large only if your Confluence content is
heavily multilingual or filled with specialised technical jargon where the extra
embedding capacity measurably improves NDCG on your eval set.
Reranker
Keep Cohere rerank-v4.0-fast for managed convenience. Switch to
BAAI/bge-reranker-v2-m3 (self-hosted, ~568 MB) if your data governance
policy prohibits sending document snippets to an external API.
Chunking
Replace the character-based chunking in utils/confluence.py with
semantic chunking using Docling. Docling understands Confluence's HTML
structure — it splits at heading boundaries, keeps table rows together, and
does not cut mid-sentence. Pair this with a parent-child chunking strategy:
embed small chunks (~256 tokens) for high-precision retrieval, but return the
parent section (~1 024 tokens) as context to the agent. This gives precise
matching without truncating the evidence the model needs to answer well.
Running on AWS
If your company operates on AWS, each layer of the stack maps to a managed AWS service. You have three realistic paths depending on how much control you want over the retrieval pipeline.
Path 1 — Qdrant on AWS (best retrieval quality, your infra)
Run Qdrant inside your own AWS account so data never leaves your perimeter, while keeping Qdrant's native hybrid search.
Qdrant deployment | When to choose |
Qdrant Cloud on AWS | Fastest start; Qdrant manages ops; pick the AWS region closest to your app. Data lives in Qdrant's AWS account — check your data residency policy first. |
Qdrant on EKS | Already run Kubernetes; use the official Qdrant Helm chart; EBS volumes for persistent storage; IAM for pod-level auth. |
Qdrant on ECS Fargate | No Kubernetes; run the official Docker image as a Fargate service; EFS mount for persistence; simpler ops than EKS. |
Path 2 — Amazon OpenSearch Service (AWS-native, recommended for most AWS teams)
This is the pragmatic default for AWS. OpenSearch Service is fully managed,
stays inside your AWS account, and has supported hybrid search (BM25 + k-NN
vector in a single query) since version 2.10. It replaces Qdrant without any
change to the MCP server or agent — only utils/retrieval.py changes.
Why to choose OpenSearch Service over Qdrant on AWS:
IAM authentication — no separate credentials; attach an IAM role to your workers and MCP server, and OpenSearch accepts them natively.
VPC isolation — deploy into a private VPC subnet; no public endpoint needed.
CloudWatch integration — cluster metrics, slow query logs, and index statistics flow to CloudWatch out of the box.
One fewer vendor — no Qdrant Cloud account, no Qdrant billing, no separate support contract. Everything is under your existing AWS bill.
Familiar to AWS ops teams — most AWS platform teams already know how to run OpenSearch.
The trade-off: OpenSearch is slower per node than Qdrant at very high query volume (Qdrant is Rust, OpenSearch is JVM). For a company-internal Confluence search workload you will not reach that ceiling.
Path 3 — Amazon Bedrock Knowledge Bases (fully managed, zero pipeline)
If engineering time is the bottleneck, Bedrock Knowledge Bases can eliminate most of the ingestion pipeline. It has a native Confluence data source connector — you supply OAuth credentials and a space list, and Bedrock handles fetching, chunking, embedding (Amazon Titan or third-party models), and indexing into either OpenSearch Serverless or Aurora pgvector.
What you keep: the MCP server and the Streamlit chatbot.
What you replace: 1-fetch-confluence.py, 2-build-index.py,
utils/retrieval.py, and utils/confluence.py.
The search_confluence tool in the MCP server calls the Bedrock
Retrieve API instead of Qdrant directly.
Trade-offs:
Less control over chunk size, overlap, and chunking strategy.
Reranking must go through Bedrock's own reranker; Cohere is not directly pluggable.
Bedrock Agents (not your MCP server) handles the agentic loop if you use
RetrieveAndGenerate. If you want to keep the MCP pattern, call only theRetrieveAPI and drive the loop from your FastMCP server as today.Cold-start latency on OpenSearch Serverless can be high for infrequent queries.
AWS services mapping
Role | AWS service |
Confluence change events | Confluence webhooks → API Gateway → SQS |
Ingestion workers | ECS Fargate (auto-scaling) or EKS |
Vector store (AWS-native) | Amazon OpenSearch Service (hybrid BM25 + k-NN) |
Vector store (Qdrant path) | Qdrant on EKS or ECS Fargate |
Fully managed RAG | Amazon Bedrock Knowledge Bases + Confluence connector |
Query result cache | ElastiCache for Redis (TTL 15 min) |
MCP server hosting | ECS Fargate behind an ALB |
TLS termination | ALB (ACM certificate, no Nginx needed) |
DDoS + WAF | AWS WAF on the ALB |
API key / secret storage | AWS Secrets Manager (rotate without redeploy) |
Container images | ECR (Elastic Container Registry) |
Monitoring + alerts | CloudWatch metrics, alarms, and dashboards |
Embedding API | OpenAI (via internet) or Amazon Bedrock hosted models |
Reranker API | Cohere (via internet) or Bedrock reranker |
AWS architecture diagram
Confluence Cloud
│
│ Webhooks (page_created / updated / deleted)
▼
┌──────────────┐ ┌─────────────────────────────────────────┐
│ API Gateway │───►│ Amazon SQS │
│ (webhook │ │ — buffers change events, decouples │
│ endpoint) │ │ Confluence rate limits from workers │
└──────────────┘ └──────────────────┬──────────────────────┘
│
▼
┌─────────────────────────────┐
│ ECS Fargate — Ingestion │
│ Workers (auto-scaling) │
│ │
│ 1. Fetch page (Confluence │
│ REST API) │
│ 2. Chunk (Docling) │
│ 3. Embed (OpenAI / Bedrock) │
│ 4. Upsert to vector store │
└──────────────┬──────────────┘
│
┌──────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌─────────────┐ ┌────────────────────┐
│ Amazon OpenSearch │ │ Qdrant on │ │ Bedrock Knowledge │
│ Service │ │ EKS / ECS │ │ Bases │
│ (BM25 + k-NN │ │ (dense + │ │ (managed, native │
│ hybrid, IAM) │ │ sparse) │ │ Confluence sync) │
└────────┬─────────┘ └──────┬──────┘ └─────────┬──────────┘
└──────────────────┬┘ │
│ │
▼ │
┌────────────────────┐ │
│ ECS Fargate │ │
│ MCP Server │◄────────────┘
│ (FastMCP) │
│ + Secrets Manager │ ┌──────────────────┐
│ for API keys │◄►│ ElastiCache Redis │
└────────┬────────────┘ │ query cache │
│ │ TTL 15 min │
│ └──────────────────┘
▼
┌────────────────────────┐
│ ALB (HTTPS, ACM cert) │
│ + AWS WAF │
└────────────┬───────────┘
│ HTTPS MCP (SSE)
▼
Claude Desktop / Cursor / Claude Code
Streamlit chatbot (ECS Fargate or EC2)AWS decision guide
Your situation | Recommended path |
Starting from scratch on AWS, want best retrieval quality | Qdrant Cloud on AWS (fastest) → migrate to EKS when you need data residency |
Data must stay in your AWS account, run Kubernetes | Qdrant on EKS |
Data must stay in your AWS account, no Kubernetes | Amazon OpenSearch Service |
AWS platform team, want IAM + CloudWatch native | Amazon OpenSearch Service |
Engineering time is the bottleneck, want it working this week | Amazon Bedrock Knowledge Bases |
Already run pgvector / RDS Aurora, corpus < 1 M chunks | pgvector — no new service needed |
Production architecture diagram (generic)
Confluence Cloud
│
│ Webhooks (page_created / page_updated / page_deleted)
│ or scheduled polling every 15 min via REST API (lastModified filter)
▼
┌─────────────────────┐
│ Message Queue │ Redis Streams / AWS SQS / RabbitMQ
│ change events │ — decouples ingestion speed from Confluence rate limits
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Ingestion Worker │ One or more worker processes (Celery / ARQ / plain threads)
│ │ 1. Fetch changed page from Confluence REST API
│ fetch → chunk │ 2. Semantic chunk with Docling
│ embed → upsert │ 3. Embed with text-embedding-3-small (batch)
│ │ 4. Upsert dense + sparse vectors into Qdrant
└──────────┬──────────┘ 5. Delete Qdrant points for removed pages
│
▼
┌─────────────────────────────────────────────────────┐
│ Qdrant Collection │
│ │
│ point_id: chunk_id │
│ dense vector: text-embedding-3-small (1 536 dims) │
│ sparse vector: BM42 │
│ payload: { page_id, title, space_key, url, │
│ last_modified, labels[], ancestor_ids[] }│
│ │
│ Payload indexes on: space_key, labels, last_modified│
└──────────┬──────────────────────────────────────────┘
│
│ Hybrid query (dense + sparse + RRF) with payload filter
│ → top-50 candidates
│ → Cohere rerank → top-10
▼
┌─────────────────────┐ ┌───────────────────────┐
│ MCP Server │ │ Redis Cache │
│ (FastMCP) │◄──────►│ query → results │
│ │ │ TTL: 15 min │
│ + API key auth │ └───────────────────────┘
│ + rate limiting │
│ + request logging │
└──────────┬──────────┘
│ MCP protocol (SSE or Streamable HTTP)
▼
Claude Desktop / Cursor / Claude Code / Streamlit chatbotAccess control
This is the most important production concern that the prototype ignores entirely. Confluence has page-level and space-level permissions. Without access control, any user of the chatbot can retrieve content from restricted pages they would not normally be allowed to read.
There are three approaches, ordered by accuracy vs. implementation cost:
Option A — Space-level filtering (simplest, coarse-grained) Index only pages from spaces the chatbot is authorised to see. At query time, filter Qdrant results to the spaces the requesting user's Confluence group can access. Works well when spaces map cleanly to team boundaries. Leaks nothing as long as restricted content is in its own space.
Option B — Permission index (recommended for most companies)
When ingesting each page, call the Confluence REST API
(GET /wiki/rest/api/content/{id}/restriction) to fetch which users and groups
can read it. Store those groups in the Qdrant payload alongside each chunk.
At query time, resolve the requesting user's group membership (via Confluence
or your IdP) and add a Qdrant payload filter groups IN [user_groups].
The filter runs inside Qdrant before scoring — no restricted results are ever
returned. Rebuild this permission payload whenever Confluence restriction changes
arrive via webhook.
Option C — Per-request Confluence API check (most accurate, slowest) After Qdrant returns candidates, call the Confluence REST API to verify the requesting user can read each candidate page, and filter out the ones they cannot. Accurate because it uses Confluence's own permission system as the source of truth, but adds a round-trip per result set and becomes a bottleneck at high query volume.
For most enterprise deployments, Option B is the right balance.
Sync strategy
Scenario | Approach |
Initial load | Bulk-fetch all spaces in parallel workers; embed in batches of 100; upsert to Qdrant |
Ongoing updates | Confluence webhooks → queue → worker upserts only changed chunks |
No webhook access | Scheduled polling every 15 min using |
Page deleted | Webhook |
Space deleted | Delete all Qdrant points where |
AWS cost estimates
Prices below are approximate, based on us-east-1 on-demand rates as of mid-2026. Verify current prices with the AWS Pricing Calculator before budgeting. All figures are per month.
Tier assumptions
Small | Medium | Large | |
Confluence pages indexed | 10 000 | 100 000 | 500 000 |
Chunks in index | ~50 000 | ~500 000 | ~2 500 000 |
Active users | 50 | 500 | 2 000 |
Queries per day | 100 | 1 000 | 5 000 |
Queries per month | 3 000 | 30 000 | 150 000 |
Monthly cost breakdown — OpenSearch Service path
Component | Small | Medium | Large |
Amazon OpenSearch Service | |||
Instance (t3.small × 2 nodes HA) | $52 | — | — |
Instance (m6g.large × 3 nodes HA) | — | $320 | — |
Instance (m6g.2xlarge × 3 nodes HA) | — | — | $1 280 |
EBS storage (gp3) | $3 | $27 | $135 |
ECS Fargate — MCP server | |||
0.5 vCPU / 1 GB (1 instance) | $18 | — | — |
0.5 vCPU / 1 GB (2 instances) | — | $36 | — |
0.5 vCPU / 1 GB (4 instances) | — | — | $72 |
ECS Fargate — ingestion workers | $5 | $15 | $40 |
ElastiCache Redis | |||
cache.t3.micro | $12 | — | — |
cache.t3.small × 2 (multi-AZ) | — | $49 | — |
cache.r6g.large × 2 (multi-AZ) | — | — | $240 |
ALB + ACM certificate | $20 | $25 | $35 |
AWS WAF | — | $10 | $20 |
SQS + Secrets Manager + CloudWatch | $8 | $12 | $20 |
OpenAI text-embedding-3-small | <$1 | $1 | $5 |
Cohere rerank-v4.0-fast | $6 | $60 | $300 |
Anthropic Claude claude-sonnet-4-6 | $78 | $780 | $3 900 |
Estimated total | ~$203 | ~$1 335 | ~$6 047 |
Anthropic cost breakdown per query
Each user question triggers an agentic loop with 2–3 Claude API calls (list_spaces → search → optionally get_page). A realistic average:
Token type | Tokens per query | Cost (Sonnet 4.6) |
Input (system prompt + tool results) | ~5 500 | $0.0165 |
Output (tool selections + final answer) | ~600 | $0.0090 |
Total per query | ~$0.026 |
At 1 000 queries/day × 30 days = ~$780/month in Claude API costs alone. This is the dominant cost at every scale — infrastructure is secondary.
Key observations
The LLM API bill dominates. At medium scale Claude accounts for ~59% of total spend; at large scale ~64%. Optimise here first before touching infrastructure.
Infrastructure costs are reasonable. Even at large scale (2 500 OpenSearch nodes, 4 ECS services, Redis cluster) the AWS bill excluding API costs is ~$1 800/month — roughly one mid-level engineer's monthly salary. The ROI calculation is almost always positive.
OpenSearch storage is cheap. 2.5 M chunks × 300 tokens × ~4 bytes ≈ 3 GB of raw text. With OpenSearch overhead (inverted index + k-NN graph) plan for ~10× = ~30 GB = ~$4/month. Storage is never the problem.
Cost optimisation levers
These are ordered by impact:
1. Anthropic prompt caching — saves 20–30% on Claude costs
Enable cache_control on the system prompt and the tool definitions block.
Cached input tokens are billed at $0.30/MTok (90% discount vs $3/MTok).
The system prompt (~500 tokens) and tool list (~300 tokens) are identical
on every call and qualify for caching immediately.
2. Route simple queries to Claude Haiku — saves up to 60% on Claude costs Haiku 4.5 costs $0.80/MTok in and $4/MTok out — roughly 4× cheaper than Sonnet. Add a classifier that sends single-fact lookups ("What is a ConfigMap?") to Haiku and only escalates multi-hop questions ("What changed in our deployment process and why?") to Sonnet. If 60% of queries qualify, medium-scale Claude spend drops from ~$780 to ~$390/month.
3. Redis query caching — saves 20–40% on Claude costs for repeated questions Teams ask the same questions. "How do I request VPN access?" is asked by every new joiner. A Redis cache keyed on a normalised query hash with a 15-minute TTL (already in the architecture diagram) eliminates the Claude round-trip entirely for cache hits. Common internal Q&A workloads see 25–35% cache hit rates.
4. Reserved instances for OpenSearch — saves 30–40% on compute
A 1-year Reserved Instance for m6g.large.search drops from $0.148/hr to
~$0.088/hr. On three nodes that saves ~$215/month (from $320 to $190 per
3-node cluster). Commit only after validating instance size in production.
5. Fargate Spot for ingestion workers — saves ~70% on worker compute
Ingestion workers are interruptible — if a Spot interruption occurs, SQS
re-delivers the message and the worker retries. Switch the ECS task definition
to use FARGATE_SPOT capacity provider. At medium scale this saves ~$10/month;
at large scale ~$28/month.
6. Cohere free tier for small teams
The Cohere trial tier gives 1 000 free rerank calls/month. A team of 50 users
making 3 searches/day averages ~4 500 calls/month — just over the free limit.
Reduce top_k from 50 to 25 candidates sent to the reranker to halve call
volume at a small quality cost.
Realistic optimised costs
Applying caching, Haiku routing (60% of queries), and reserved instances:
Small | Medium | Large | |
Baseline estimate | ~$203 | ~$1 335 | ~$6 047 |
After optimisations | ~$110 | ~$620 | ~$2 900 |
Saving | ~46% | ~54% | ~52% |
Amazon Bedrock Knowledge Bases — cost comparison
The fully managed path trades control for simplicity but is not always cheaper:
Component | Monthly cost |
Bedrock Titan Embeddings ($0.0001/1K tokens) | ~$0.15 (initial), <$1 ongoing |
OpenSearch Serverless (minimum 4 OCUs) | ~$700 |
Bedrock | ~$0.10 per 1 000 calls |
Anthropic Claude (same as above) | same |
The OpenSearch Serverless minimum of 4 OCUs (~$700/month) makes Bedrock Knowledge Bases more expensive than self-managed OpenSearch at small and medium scale. It becomes cost-competitive only above ~2 M chunks where you need multiple OpenSearch data nodes anyway. The main argument for Bedrock Knowledge Bases is not cost — it is engineering time saved on the ingestion pipeline.
Disclaimer: All figures are estimates based on public AWS and API pricing as of mid-2026. Actual costs depend on your specific usage patterns, AWS region, negotiated enterprise pricing, and data transfer costs. Use the AWS Pricing Calculator for precise projections before committing to an architecture.
Beyond vector databases — better retrieval approaches
Swapping one vector database for another improves scale and operational robustness but does almost nothing for retrieval quality. The two approaches below address the actual quality bottlenecks for a Confluence-sized corpus.
Approach 1 — Contextual Retrieval
What it is: Before indexing each chunk, ask Claude to prepend a short context paragraph describing where the chunk sits in the document, what the page is about, and why this section matters. Then index with BM25 only — no dense embeddings at all — and apply the existing Cohere reranker on top.
Anthropic published benchmarks in late 2024 showing this approach achieves 49% fewer retrieval failures compared to naive BM25 + dense hybrid. BM25 with prepended context outperforms BM25 + dense without it.
Why Confluence chunks need this: When you strip HTML from a Confluence page and split it into 1 500-character chunks, the chunks lose their surrounding context. A chunk that says "set the flag to true to enable this feature" scores well for the query "how do I enable features" but is useless without knowing which page it came from and which feature it refers to. The prepended context sentence — "This chunk is from the Engineering Handbook, section on Feature Flags, describing how to enable a new flag in the production config service" — makes the chunk self-contained and dramatically more retrievable.
What changes in the pipeline:
Step | Without contextual retrieval | With contextual retrieval |
Indexing | chunk → BM25 + embed → store | chunk → Claude context → contextual chunk → BM25 → store |
Dense embeddings | Required | Removed entirely |
Vector database | Required | Not needed |
BM25 index | Required | Required (same) |
Reranker | Required | Required (same) |
MCP server | Unchanged | Unchanged |
Agent | Unchanged | Unchanged |
One-time indexing cost (Claude Haiku at $0.80/MTok):
Corpus size | Chunks | Context tokens | Indexing cost |
10 000 pages | 50 000 | ~10 M tokens | ~$8 |
100 000 pages | 500 000 | ~100 M tokens | ~$80 |
500 000 pages | 2 500 000 | ~500 M tokens | ~$400 |
This is a one-time cost per full re-index, not a recurring monthly expense. Delta syncs (only changed pages) are proportionally cheaper.
Running cost impact: Removing dense embeddings eliminates the OpenAI embedding API call on every query (~$0.02/1M tokens, small but real), removes the vector index from OpenSearch or Qdrant (reducing storage by 30–50%), and simplifies the retrieval code to a single BM25 query path.
AWS implementation: Keep OpenSearch Service for BM25. Remove the k-NN plugin configuration and dense vector field entirely. Add a pre-indexing ECS task that calls the Anthropic API to generate context for each chunk before the ingestion worker writes to OpenSearch.
Approach 2 — GraphRAG
What it is: Instead of treating Confluence as a flat collection of text chunks, build a knowledge graph from it — extracting entities, relationships, and summaries — and answer questions by traversing the graph rather than scoring chunks by similarity.
Microsoft Research published GraphRAG in 2024 and showed 20–70% improvement over naive RAG on complex multi-hop questions depending on question type. The gains are largest on exactly the investigative questions that Confluence is used for.
Why Confluence is a natural graph: Confluence already has rich structure that flat retrieval throws away:
Space
└── Section page
├── Child page A ──links to──► ADR-042
│ └── Child page A1 │
└── Child page B └──triggered by──► Incident-2024-Q3
(author: team-platform) (owner: team-sre)When a user asks "What changed in our deployment process last quarter and why?", the answer requires:
Find the current deployment process page
Follow links to the ADR that modified it
Find the incident report that triggered the ADR
Synthesise the chain of causality across three pages
A flat vector search returns the chunks with the highest similarity score. A graph traversal follows the actual structure of the knowledge.
How GraphRAG works for Confluence:
Ingestion
│
├─► Extract entities from each page
│ (system names, team names, process names, ticket IDs, dates)
│
├─► Extract relationships between entities
│ (page A links to page B, process X was changed by ADR Y,
│ incident Z triggered decision W)
│
├─► Build community summaries
│ (cluster related pages into topics, summarise each cluster)
│
└─► Store in graph database (Neptune) + keep BM25 for keyword search
Query
│
├─► Global questions ("what are our main deployment processes?")
│ → community summary traversal, no chunk retrieval needed
│
└─► Local questions ("how do I deploy service X?")
→ entity lookup → graph hop → retrieve relevant pages → answerTwo query modes:
Mode | Best for | How it works |
Local search | Specific factual questions | Find entity in graph → traverse 1–2 hops → retrieve source pages → answer |
Global search | Broad thematic questions | Query community summaries → synthesise across the whole corpus |
AWS implementation: Amazon Neptune (fully managed graph database) for the knowledge graph. Neptune Analytics (in-memory graph engine, announced 2023) for fast traversal queries. The ingestion worker gains a graph extraction step that calls Claude to identify entities and relationships per page, then writes edges to Neptune alongside the existing OpenSearch BM25 upsert.
AWS architecture addition for GraphRAG:
Ingestion worker
│
├─► (existing) chunk → BM25 upsert → OpenSearch
│
└─► (new) page full text → Claude entity extraction
→ Neptune upsert (nodes + edges)
→ Neptune Analytics community clustering (nightly)
Query (MCP server)
│
├─► hybrid_search (existing BM25 + rerank path)
│
└─► graph_search (new tool)
→ Neptune Analytics traversal
→ fetch source pages
→ synthesise answerThe existing hybrid_search, get_page_full, and list_spaces MCP tools
remain unchanged. graph_search is an additional fourth tool the agent can
call for questions that require following relationships across pages.
Cost addition (medium scale, 100k pages):
Component | Monthly cost |
Amazon Neptune (db.r6g.large) | ~$200 |
Neptune Analytics (2 NCUs) | ~$180 |
Claude Haiku entity extraction (initial, one-time) | ~$50 |
Claude Haiku entity extraction (monthly delta, 5% change) | ~$3 |
Total addition | ~$383/month |
Comparison across all approaches
Approach | Retrieval quality | Ops complexity | Monthly cost delta | Best for |
Prototype (BM25+dense+RRF+rerank) | Good | Low | baseline | Getting started |
Vector DB upgrade (Qdrant/OpenSearch kNN) | Good | Medium | +$0–200 | Scale, not quality |
Contextual Retrieval + BM25 | Very good | Low | −$50–200 (saves embedding cost) | Best quality-to-effort ratio |
GraphRAG | Excellent on multi-hop | High | +$350–600 | Complex investigative questions |
Contextual + GraphRAG | Excellent across all types | Very high | +$250–400 net | Full enterprise production |
Recommended implementation order
Phase 1 — Contextual Retrieval (week 1–2) Add a context-generation step before BM25 indexing. Remove dense embeddings and the vector index. This is the highest-ROI change: better retrieval quality, lower running cost, less infrastructure. The MCP server, agent, and chatbot are untouched.
Phase 2 — Vector DB for scale (month 2–3, if corpus > 500k chunks) If the corpus grows large enough that OpenSearch BM25 performance degrades, add a vector index (OpenSearch kNN or Qdrant) back in alongside the contextual BM25. At this scale the quality gain from contextual retrieval still applies on top.
Phase 3 — GraphRAG (month 4–6, if multi-hop questions dominate)
Instrument user queries for 4–6 weeks after Phase 1. If a significant share
of questions require tracing relationships across pages (history of a decision,
ownership chains, impact of an incident), add the Neptune graph layer and the
graph_search MCP tool. This is a non-trivial engineering investment — only
do it when the query analysis confirms it is the right bottleneck to fix.
Key principle: The vector database is a storage and scale concern. Contextual Retrieval and GraphRAG are quality concerns. Fix quality first, then scale the storage layer to match the corpus size.
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/piter5285/hybrid-agentic-RAG'
If you have feedback or need assistance with the MCP directory API, please join our Discord server