Skip to main content
Glama
piter5285

Confluence Hybrid RAG MCP Server

by piter5285

Confluence Hybrid RAG + Agentic RAG + MCP + PROD Design

This project implements hybrid agentic RAG over Confluence using Python, bm25s, OpenAI embeddings, RRF, Cohere reranker, Pydantic-AI, and FastMCP. An MCP server and Streamlit chatbot are implemented as core functionalities, enabling Confluence search from Claude Desktop, Cursor, or Claude Code.


Table of Contents


Related MCP server: Confluence Communication Server

Quick reference — choose your retrieval approach

Pick your approach based on corpus size and the type of questions your users ask. Details, cost estimates, and AWS implementation for each option are in the Production architecture section below.

By corpus size

Confluence pages

Recommended approach

Storage needed

< 50 000

Contextual BM25 + reranker

OpenSearch only

50 000 – 300 000

Contextual BM25 + reranker

OpenSearch only

300 000 – 1 000 000

Contextual hybrid (BM25 + dense) + reranker

OpenSearch + vector index

> 1 000 000

Contextual hybrid + reranker + GraphRAG

OpenSearch + vector index + Neptune

By question type

Questions your users ask

Best approach

"How do I do X?" — specific how-to

Contextual BM25 + reranker

"What is X?" — definitions, policies

Contextual BM25 + reranker

"Find anything about X" — paraphrased, exploratory

Contextual hybrid (BM25 + dense)

"What changed in X and why?" — history, causality

GraphRAG

"Who owns X and what is its current status?" — ownership chains

GraphRAG

All of the above

Contextual hybrid + GraphRAG

All options at a glance

Approach

Quality

Ops complexity

AWS cost / month (medium scale)

Removes vector DB?

Prototype — BM25 + dense + RRF + rerank

Good

Low

~$1 335

No

Vector DB upgrade — Qdrant / OpenSearch kNN

Good

Medium

~$1 400

No

Contextual BM25 — Claude context prefix + BM25 + rerank

Very good

Low

~$1 000

Yes

Contextual hybrid — context prefix + BM25 + dense + rerank

Excellent

Medium

~$1 200

No

GraphRAG — knowledge graph traversal

Excellent on multi-hop

High

~$1 720

Optional

Contextual hybrid + GraphRAG

Best across all query types

Very high

~$1 900

No

  1. Start with Contextual BM25 — add a Claude context-generation step before indexing, remove dense embeddings. Better quality, lower cost, less infrastructure. No changes to the MCP server, agent, or chatbot.

  2. Add dense embeddings back when you observe users rephrasing the same question multiple ways and getting inconsistent results — the signal that BM25 recall is the bottleneck.

  3. Add GraphRAG only after analysing real user queries for 4–6 weeks and confirming that multi-hop questions (history of a decision, ownership chains, impact of an incident) represent a significant share of traffic.

Key principle: the vector database is a scale concern, not a quality concern. Contextual Retrieval and GraphRAG address the actual quality bottlenecks. Fix quality first, then scale storage to match corpus size.


Why combine the two patterns?

What hybrid retrieval solves

Your company wiki uses different words than your users do. A user asks: "How do we handle auth failures?" — the Confluence page is titled "Authentication error handling". BM25 alone misses it (no word overlap). Dense search alone misses it (rare product names get diluted). The BM25 + dense + RRF + rerank pipeline from knowledge/hybrid-retrieval catches both.

What agentic retrieval solves

Some questions need more than one search: "What changed in the deployment process last quarter and why?" requires reading the current process page, finding the ADR that changed it, and cross-referencing an incident report. A single-shot retrieval can't do that. The agentic loop from knowledge/agentic-rag lets the model search multiple times, follow leads, and synthesise across pages.

What MCP adds

MCP makes the whole thing a first-class tool inside any MCP client. Claude in your IDE or chat interface can search Confluence on its own when it needs internal context — without you having to copy-paste docs into the prompt manually.

Architecture

Confluence REST API
       │
       ▼
1-fetch-confluence.py   pull pages → strip HTML → chunk → chunks/*.json
       │
       ▼
2-build-index.py        BM25 index (bm25s)  +  dense embeddings (OpenAI)
                                │                        │
                                └────────────┬───────────┘
                                             ▼
                               indexes/bm25/  indexes/embeddings.npy
                               indexes/meta.json
                                             │
                        ┌────────────────────┼───────────────────────┐
                        ▼                    ▼                       ▼
               3-hybrid-search.py      4-agent.py           5-mcp-server.py
               (interactive CLI)    (pydantic-ai agent)   (FastMCP → Claude)

Retrieval pipeline (inside every search call)

Query
  │
  ├─► BM25           catches exact terms, product names, ticket IDs
  │                  e.g. "JIRA-4521", "prod-db-01", "rerank-v4.0-fast"
  │
  ├─► Dense          catches paraphrase and synonyms
  │   (cosine)       e.g. "auth failure" ↔ "authentication error"
  │
  ├─► RRF            fuses the two ranked lists without score normalisation
  │                  (RRF sidesteps the problem that BM25 and cosine scores
  │                   live on completely different scales)
  │
  └─► Cohere rerank  cross-encoder that sees query + document jointly
      top-50 in      much higher precision than bi-encoder dense alone
      → top-10 out

Agentic loop (inside 4-agent.py and every MCP session)

User question
      │
      ▼
 list_spaces          (which Confluence domains are indexed?)
      │
      ▼
 hybrid_search ──────► BM25 + dense + RRF + rerank
      │
      ▼
 snippets relevant?
      ├─ Yes ──► synthesise answer
      └─ No  ──► get_page_full (read a full page)
                 or hybrid_search again with a refined query
                      │
                      ▼
              synthesise answer + citations

MCP integration

The MCP server in 5-mcp-server.py exposes three tools:

Tool

What it does

list_confluence_spaces

Returns indexed spaces (key + name)

search_confluence

Four-stage hybrid search; returns snippets

get_confluence_page

Returns full text of one page by page_id

Claude acts as the agent — it calls these tools in a loop the same way 4-agent.py does internally. No duplicate agent layer on the server.

Connect Claude Desktop

{
  "mcpServers": {
    "confluence": {
      "url": "http://localhost:8051/sse"
    }
  }
}

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json Windows: %APPDATA%\Claude\claude_desktop_config.json

Connect Claude Code

Add .mcp.json in your project root:

{
  "mcpServers": {
    "confluence": {
      "type": "sse",
      "url": "http://localhost:8051/sse"
    }
  }
}

Connect Cursor

Settings → MCP → Add server → URL: http://localhost:8051/sse

Setup

# 1. Install dependencies (from the repo root)
uv sync

# 2. Copy and fill in credentials
cp .env.example .env

# 3. Fetch Confluence pages (run with empty CONFLUENCE_SPACE_KEYS first
#    to see available spaces, then set the ones you want and re-run)
uv run 1-fetch-confluence.py

# 4. Build BM25 + dense indexes (~$0.002 per 1 000 chunks)
uv run 2-build-index.py

# 5. Test retrieval interactively
uv run 3-hybrid-search.py

# 6. Try the full agent (optional)
uv run 4-agent.py "What is our deployment process?"

# 7. Start the MCP server for Claude Desktop / Cursor / Claude Code
uv run 5-mcp-server.py

You need four API keys in .env:

  • CONFLUENCE_BASE_URL + CONFLUENCE_EMAIL + CONFLUENCE_API_TOKEN

  • OPENAI_API_KEY — embeddings (text-embedding-3-small)

  • COHERE_API_KEY — reranker (rerank-v4.0-fast, free tier)

  • ANTHROPIC_API_KEY — only needed for 4-agent.py

Testing without a Confluence instance

1-fetch-k8s.py is a drop-in replacement for 1-fetch-confluence.py that scrapes the public Kubernetes documentation (kubernetes.io/docs) instead of your Confluence instance. It produces chunks in the exact same JSON format, so every subsequent step — 2-build-index.py, 3-hybrid-search.py, 5-mcp-server.py, 6-chatbot.py — runs unchanged. No API keys are needed.

Prerequisites

# beautifulsoup4 is the only extra dependency
uv pip install beautifulsoup4

Running the scraper

# Scrape ~190 Kubernetes docs pages (≈ 2 min at 0.5 s/request)
uv run 1-fetch-k8s.py

# Then continue with the normal pipeline
uv run 2-build-index.py
uv run 5-mcp-server.py              # terminal 1
uv run streamlit run 6-chatbot.py   # terminal 2

Expected output:

Fetching sitemap: https://kubernetes.io/en/sitemap.xml
Found 192 pages to index

[  1/192] Concepts                                           3 chunk(s)
[  2/192] Kubernetes Components                              4 chunk(s)
...
Done: 189 pages → 847 chunks  (2 empty, 1 errors)
Chunks saved to: .../chunks/

Configuration

All tunable constants are at the top of 1-fetch-k8s.py:

Constant

Default

What it controls

INCLUDE_SECTIONS

concepts/, tasks/, tutorials/, setup/, reference/glossary/, reference/kubectl/

Sections of kubernetes.io/docs to crawl

SKIP_PATTERNS

reference/kubernetes-api/, contribute/

Sub-paths excluded even if they fall under an included section

REQUEST_DELAY

0.5 s

Pause between HTTP requests (respectful crawl rate)

MAX_CHUNK_CHARS

1500

Maximum characters per chunk

OVERLAP_CHARS

150

Overlap between consecutive chunks

The large API reference (reference/kubernetes-api/, ~600 pages of spec tables) is excluded by default to keep index size manageable. Add it to INCLUDE_SECTIONS if you need API field lookups.

Output format

Each chunk is saved as a JSON file in chunks/ and uses space_key: "K8S". The schema is identical to Confluence chunks, so 3-hybrid-search.py and the MCP server treat them the same way:

{
  "chunk_id": "docs-concepts-workloads-pods_c0",
  "page_id": "docs-concepts-workloads-pods",
  "title": "Pods",
  "space_key": "K8S",
  "space_name": "Kubernetes Docs",
  "url": "https://kubernetes.io/docs/concepts/workloads/pods/",
  "text": "...",
  "chunk_idx": 0,
  "last_modified": "Mon, 10 Jun 2024 12:00:00 GMT"
}

Example queries once running

  • "What is the difference between a Deployment and a StatefulSet?"

  • "How do I configure resource limits for a Pod?"

  • "What happens when a node fails?"

  • "How does the Kubernetes scheduler decide where to place a Pod?"

Files

./
├── 1-fetch-confluence.py   Production: fetch & chunk Confluence pages
├── 1-fetch-k8s.py          Test/demo: scrape public Kubernetes docs
├── 2-build-index.py        Build BM25 + dense indexes
├── 3-hybrid-search.py      Interactive search CLI
├── 4-agent.py              Pydantic-AI agent (standalone)
├── 5-mcp-server.py         FastMCP server for Claude/Cursor/Claude Code
├── 6-chatbot.py            Streamlit chatbot (MCP client + chat UI)
├── pyproject.toml          All dependencies
├── .env.example
└── utils/
    ├── confluence.py        ConfluenceClient + html_to_text + chunk()
    ├── retrieval.py         HybridRetriever (BM25+dense+RRF+rerank)
    └── agent_tools.py       list_spaces / hybrid_search / get_page_full
                             shared by 4-agent.py and 5-mcp-server.py

Keeping the index fresh

Confluence changes. A simple refresh loop:

# Re-fetch changed pages and rebuild indexes (run nightly via cron)
uv run 1-fetch-confluence.py && uv run 2-build-index.py

For incremental updates, add last_modified filtering in iter_pages() to skip pages not changed since the last run (compare against the timestamp stored in indexes/meta.json).

Extending

Want to add

Where to change

Filter by Confluence label/ancestor

ConfluenceClient.iter_pages() — add expand=ancestors,metadata.labels

Parent/child page navigation tool

Add get_child_pages(page_id) to utils/agent_tools.py

Local reranker (offline)

Swap Cohere in HybridRetriever for BAAI/bge-reranker-v2-m3

Vector DB instead of numpy

Replace np.load in HybridRetriever with Qdrant/LanceDB client

Incremental index updates

Track last_modified in meta.json, skip unchanged pages in step 1

Evaluate retrieval quality

Use knowledge/hybrid-retrieval/docs/build-your-own-eval.md as recipe


Production architecture (many GB of Confluence data)

The prototype works well for hundreds of pages but hits hard limits at scale. This section explains what breaks and how to redesign each layer for production.

Why the prototype does not scale

Component

Prototype

Breaks when…

Dense index

embeddings.npy loaded fully into RAM

>500 k chunks (~3 GB at 1536 dims, fp32)

BM25 index

bm25s rebuilt from scratch every sync

Corpus grows beyond a few hundred MB of text

Sync

Full re-fetch + full re-index

Pages number in the tens of thousands

MCP server

Single-process, no auth

Multiple concurrent users, internal tool exposure

Access control

None — every user sees every page

Any team with Confluence page restrictions

Chunking

Character-based with fixed overlap

Long structured pages (tables, code blocks split badly)


Vector database

The right choice depends on where you run infrastructure. The short version: Qdrant if you want the best hybrid search engine; Amazon OpenSearch Service if you are already on AWS and want to stay AWS-native; Bedrock Knowledge Bases if you want zero ingestion pipeline work.

Option

Cloud

Choose if…

Trade-off

Qdrant Cloud

Any (hosted on AWS/GCP/Azure infra)

Starting fresh, want managed ops

Easiest setup; data in Qdrant's account

Qdrant on EKS/ECS

AWS

Data must stay in your AWS account; already run containers

You manage upgrades and backups

Qdrant self-hosted

On-prem / any VM

Full control, air-gapped environments

Full operational burden

Amazon OpenSearch Service

AWS

Already on AWS, want IAM + CloudWatch native integration

Slightly slower ANN than Qdrant at extreme scale

Amazon Bedrock Knowledge Bases

AWS

Want zero ingestion pipeline; Confluence connector built-in

Less control over chunking, reranking, and MCP integration

Azure AI Search

Azure

Already on Azure

Native Confluence connector; higher cost per query

pgvector (RDS/Aurora)

Any

Small corpus (<1 M chunks), already run PostgreSQL

Simplest ops; ANN slows above ~5 M vectors

Pinecone

Any

No native sparse support in standard tier; vendor lock-in

Qdrant is the strongest general-purpose choice because it is the only option in this list with native hybrid search — dense + sparse vectors in a single query with built-in RRF — without requiring application-level result merging. For AWS-specific guidance see the Running on AWS section below.

Sparse vectors — BM42 instead of BM25

In production, replace the bm25s BM25 index with BM42 sparse vectors stored inside Qdrant. BM42 is a neural sparse model (built into Qdrant's FastEmbed library) that produces sparse vectors compatible with Qdrant's sparse index. It significantly outperforms classical BM25 on paraphrased queries while preserving the exact-term matching that dense embeddings miss. The result is a single Qdrant collection with both a dense vector field and a sparse vector field, queried together in one round-trip.

Embedding model

Keep text-embedding-3-small as the default (best cost/quality ratio for this use case). Switch to text-embedding-3-large only if your Confluence content is heavily multilingual or filled with specialised technical jargon where the extra embedding capacity measurably improves NDCG on your eval set.

Reranker

Keep Cohere rerank-v4.0-fast for managed convenience. Switch to BAAI/bge-reranker-v2-m3 (self-hosted, ~568 MB) if your data governance policy prohibits sending document snippets to an external API.

Chunking

Replace the character-based chunking in utils/confluence.py with semantic chunking using Docling. Docling understands Confluence's HTML structure — it splits at heading boundaries, keeps table rows together, and does not cut mid-sentence. Pair this with a parent-child chunking strategy: embed small chunks (~256 tokens) for high-precision retrieval, but return the parent section (~1 024 tokens) as context to the agent. This gives precise matching without truncating the evidence the model needs to answer well.


Running on AWS

If your company operates on AWS, each layer of the stack maps to a managed AWS service. You have three realistic paths depending on how much control you want over the retrieval pipeline.

Path 1 — Qdrant on AWS (best retrieval quality, your infra)

Run Qdrant inside your own AWS account so data never leaves your perimeter, while keeping Qdrant's native hybrid search.

Qdrant deployment

When to choose

Qdrant Cloud on AWS

Fastest start; Qdrant manages ops; pick the AWS region closest to your app. Data lives in Qdrant's AWS account — check your data residency policy first.

Qdrant on EKS

Already run Kubernetes; use the official Qdrant Helm chart; EBS volumes for persistent storage; IAM for pod-level auth.

Qdrant on ECS Fargate

No Kubernetes; run the official Docker image as a Fargate service; EFS mount for persistence; simpler ops than EKS.

This is the pragmatic default for AWS. OpenSearch Service is fully managed, stays inside your AWS account, and has supported hybrid search (BM25 + k-NN vector in a single query) since version 2.10. It replaces Qdrant without any change to the MCP server or agent — only utils/retrieval.py changes.

Why to choose OpenSearch Service over Qdrant on AWS:

  • IAM authentication — no separate credentials; attach an IAM role to your workers and MCP server, and OpenSearch accepts them natively.

  • VPC isolation — deploy into a private VPC subnet; no public endpoint needed.

  • CloudWatch integration — cluster metrics, slow query logs, and index statistics flow to CloudWatch out of the box.

  • One fewer vendor — no Qdrant Cloud account, no Qdrant billing, no separate support contract. Everything is under your existing AWS bill.

  • Familiar to AWS ops teams — most AWS platform teams already know how to run OpenSearch.

The trade-off: OpenSearch is slower per node than Qdrant at very high query volume (Qdrant is Rust, OpenSearch is JVM). For a company-internal Confluence search workload you will not reach that ceiling.

Path 3 — Amazon Bedrock Knowledge Bases (fully managed, zero pipeline)

If engineering time is the bottleneck, Bedrock Knowledge Bases can eliminate most of the ingestion pipeline. It has a native Confluence data source connector — you supply OAuth credentials and a space list, and Bedrock handles fetching, chunking, embedding (Amazon Titan or third-party models), and indexing into either OpenSearch Serverless or Aurora pgvector.

What you keep: the MCP server and the Streamlit chatbot. What you replace: 1-fetch-confluence.py, 2-build-index.py, utils/retrieval.py, and utils/confluence.py. The search_confluence tool in the MCP server calls the Bedrock Retrieve API instead of Qdrant directly.

Trade-offs:

  • Less control over chunk size, overlap, and chunking strategy.

  • Reranking must go through Bedrock's own reranker; Cohere is not directly pluggable.

  • Bedrock Agents (not your MCP server) handles the agentic loop if you use RetrieveAndGenerate. If you want to keep the MCP pattern, call only the Retrieve API and drive the loop from your FastMCP server as today.

  • Cold-start latency on OpenSearch Serverless can be high for infrequent queries.

AWS services mapping

Role

AWS service

Confluence change events

Confluence webhooks → API GatewaySQS

Ingestion workers

ECS Fargate (auto-scaling) or EKS

Vector store (AWS-native)

Amazon OpenSearch Service (hybrid BM25 + k-NN)

Vector store (Qdrant path)

Qdrant on EKS or ECS Fargate

Fully managed RAG

Amazon Bedrock Knowledge Bases + Confluence connector

Query result cache

ElastiCache for Redis (TTL 15 min)

MCP server hosting

ECS Fargate behind an ALB

TLS termination

ALB (ACM certificate, no Nginx needed)

DDoS + WAF

AWS WAF on the ALB

API key / secret storage

AWS Secrets Manager (rotate without redeploy)

Container images

ECR (Elastic Container Registry)

Monitoring + alerts

CloudWatch metrics, alarms, and dashboards

Embedding API

OpenAI (via internet) or Amazon Bedrock hosted models

Reranker API

Cohere (via internet) or Bedrock reranker

AWS architecture diagram

Confluence Cloud
      │
      │  Webhooks (page_created / updated / deleted)
      ▼
┌──────────────┐    ┌─────────────────────────────────────────┐
│ API Gateway  │───►│  Amazon SQS                              │
│ (webhook     │    │  — buffers change events, decouples      │
│  endpoint)   │    │    Confluence rate limits from workers   │
└──────────────┘    └──────────────────┬──────────────────────┘
                                       │
                                       ▼
                        ┌─────────────────────────────┐
                        │  ECS Fargate — Ingestion     │
                        │  Workers (auto-scaling)      │
                        │                              │
                        │  1. Fetch page (Confluence   │
                        │     REST API)                │
                        │  2. Chunk (Docling)          │
                        │  3. Embed (OpenAI / Bedrock) │
                        │  4. Upsert to vector store   │
                        └──────────────┬──────────────┘
                                       │
                    ┌──────────────────┼────────────────────┐
                    │                  │                     │
                    ▼                  ▼                     ▼
         ┌──────────────────┐  ┌─────────────┐  ┌────────────────────┐
         │ Amazon OpenSearch │  │  Qdrant on  │  │ Bedrock Knowledge  │
         │ Service          │  │  EKS / ECS  │  │ Bases              │
         │ (BM25 + k-NN     │  │  (dense +   │  │ (managed, native   │
         │  hybrid, IAM)    │  │   sparse)   │  │  Confluence sync)  │
         └────────┬─────────┘  └──────┬──────┘  └─────────┬──────────┘
                  └──────────────────┬┘                    │
                                     │                     │
                                     ▼                     │
                        ┌────────────────────┐             │
                        │  ECS Fargate        │             │
                        │  MCP Server         │◄────────────┘
                        │  (FastMCP)          │
                        │  + Secrets Manager  │  ┌──────────────────┐
                        │    for API keys     │◄►│ ElastiCache Redis │
                        └────────┬────────────┘  │ query cache       │
                                 │               │ TTL 15 min        │
                                 │               └──────────────────┘
                                 ▼
                    ┌────────────────────────┐
                    │  ALB (HTTPS, ACM cert) │
                    │  + AWS WAF             │
                    └────────────┬───────────┘
                                 │  HTTPS MCP (SSE)
                                 ▼
              Claude Desktop / Cursor / Claude Code
              Streamlit chatbot (ECS Fargate or EC2)

AWS decision guide

Your situation

Recommended path

Starting from scratch on AWS, want best retrieval quality

Qdrant Cloud on AWS (fastest) → migrate to EKS when you need data residency

Data must stay in your AWS account, run Kubernetes

Qdrant on EKS

Data must stay in your AWS account, no Kubernetes

Amazon OpenSearch Service

AWS platform team, want IAM + CloudWatch native

Amazon OpenSearch Service

Engineering time is the bottleneck, want it working this week

Amazon Bedrock Knowledge Bases

Already run pgvector / RDS Aurora, corpus < 1 M chunks

pgvector — no new service needed


Production architecture diagram (generic)

Confluence Cloud
      │
      │  Webhooks (page_created / page_updated / page_deleted)
      │  or scheduled polling every 15 min via REST API (lastModified filter)
      ▼
┌─────────────────────┐
│   Message Queue      │   Redis Streams / AWS SQS / RabbitMQ
│   change events      │   — decouples ingestion speed from Confluence rate limits
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│   Ingestion Worker   │   One or more worker processes (Celery / ARQ / plain threads)
│                      │   1. Fetch changed page from Confluence REST API
│   fetch → chunk      │   2. Semantic chunk with Docling
│   embed → upsert     │   3. Embed with text-embedding-3-small (batch)
│                      │   4. Upsert dense + sparse vectors into Qdrant
└──────────┬──────────┘   5. Delete Qdrant points for removed pages
           │
           ▼
┌─────────────────────────────────────────────────────┐
│                     Qdrant Collection                │
│                                                      │
│  point_id: chunk_id                                  │
│  dense vector:  text-embedding-3-small (1 536 dims)  │
│  sparse vector: BM42                                 │
│  payload:  { page_id, title, space_key, url,         │
│              last_modified, labels[], ancestor_ids[] }│
│                                                      │
│  Payload indexes on: space_key, labels, last_modified│
└──────────┬──────────────────────────────────────────┘
           │
           │  Hybrid query (dense + sparse + RRF) with payload filter
           │  → top-50 candidates
           │  → Cohere rerank → top-10
           ▼
┌─────────────────────┐        ┌───────────────────────┐
│   MCP Server         │        │   Redis Cache          │
│   (FastMCP)          │◄──────►│   query → results      │
│                      │        │   TTL: 15 min          │
│   + API key auth     │        └───────────────────────┘
│   + rate limiting    │
│   + request logging  │
└──────────┬──────────┘
           │  MCP protocol (SSE or Streamable HTTP)
           ▼
  Claude Desktop / Cursor / Claude Code / Streamlit chatbot

Access control

This is the most important production concern that the prototype ignores entirely. Confluence has page-level and space-level permissions. Without access control, any user of the chatbot can retrieve content from restricted pages they would not normally be allowed to read.

There are three approaches, ordered by accuracy vs. implementation cost:

Option A — Space-level filtering (simplest, coarse-grained) Index only pages from spaces the chatbot is authorised to see. At query time, filter Qdrant results to the spaces the requesting user's Confluence group can access. Works well when spaces map cleanly to team boundaries. Leaks nothing as long as restricted content is in its own space.

Option B — Permission index (recommended for most companies) When ingesting each page, call the Confluence REST API (GET /wiki/rest/api/content/{id}/restriction) to fetch which users and groups can read it. Store those groups in the Qdrant payload alongside each chunk. At query time, resolve the requesting user's group membership (via Confluence or your IdP) and add a Qdrant payload filter groups IN [user_groups]. The filter runs inside Qdrant before scoring — no restricted results are ever returned. Rebuild this permission payload whenever Confluence restriction changes arrive via webhook.

Option C — Per-request Confluence API check (most accurate, slowest) After Qdrant returns candidates, call the Confluence REST API to verify the requesting user can read each candidate page, and filter out the ones they cannot. Accurate because it uses Confluence's own permission system as the source of truth, but adds a round-trip per result set and becomes a bottleneck at high query volume.

For most enterprise deployments, Option B is the right balance.


Sync strategy

Scenario

Approach

Initial load

Bulk-fetch all spaces in parallel workers; embed in batches of 100; upsert to Qdrant

Ongoing updates

Confluence webhooks → queue → worker upserts only changed chunks

No webhook access

Scheduled polling every 15 min using lastModified query parameter; compare against stored last_modified in Qdrant payload; skip unchanged pages

Page deleted

Webhook page_deleted event → delete all Qdrant points where page_id == deleted_id

Space deleted

Delete all Qdrant points where space_key == deleted_key


AWS cost estimates

Prices below are approximate, based on us-east-1 on-demand rates as of mid-2026. Verify current prices with the AWS Pricing Calculator before budgeting. All figures are per month.

Tier assumptions

Small

Medium

Large

Confluence pages indexed

10 000

100 000

500 000

Chunks in index

~50 000

~500 000

~2 500 000

Active users

50

500

2 000

Queries per day

100

1 000

5 000

Queries per month

3 000

30 000

150 000

Monthly cost breakdown — OpenSearch Service path

Component

Small

Medium

Large

Amazon OpenSearch Service

Instance (t3.small × 2 nodes HA)

$52

Instance (m6g.large × 3 nodes HA)

$320

Instance (m6g.2xlarge × 3 nodes HA)

$1 280

EBS storage (gp3)

$3

$27

$135

ECS Fargate — MCP server

0.5 vCPU / 1 GB (1 instance)

$18

0.5 vCPU / 1 GB (2 instances)

$36

0.5 vCPU / 1 GB (4 instances)

$72

ECS Fargate — ingestion workers

$5

$15

$40

ElastiCache Redis

cache.t3.micro

$12

cache.t3.small × 2 (multi-AZ)

$49

cache.r6g.large × 2 (multi-AZ)

$240

ALB + ACM certificate

$20

$25

$35

AWS WAF

$10

$20

SQS + Secrets Manager + CloudWatch

$8

$12

$20

OpenAI text-embedding-3-small

<$1

$1

$5

Cohere rerank-v4.0-fast

$6

$60

$300

Anthropic Claude claude-sonnet-4-6

$78

$780

$3 900

Estimated total

~$203

~$1 335

~$6 047

Anthropic cost breakdown per query

Each user question triggers an agentic loop with 2–3 Claude API calls (list_spaces → search → optionally get_page). A realistic average:

Token type

Tokens per query

Cost (Sonnet 4.6)

Input (system prompt + tool results)

~5 500

$0.0165

Output (tool selections + final answer)

~600

$0.0090

Total per query

~$0.026

At 1 000 queries/day × 30 days = ~$780/month in Claude API costs alone. This is the dominant cost at every scale — infrastructure is secondary.

Key observations

  1. The LLM API bill dominates. At medium scale Claude accounts for ~59% of total spend; at large scale ~64%. Optimise here first before touching infrastructure.

  2. Infrastructure costs are reasonable. Even at large scale (2 500 OpenSearch nodes, 4 ECS services, Redis cluster) the AWS bill excluding API costs is ~$1 800/month — roughly one mid-level engineer's monthly salary. The ROI calculation is almost always positive.

  3. OpenSearch storage is cheap. 2.5 M chunks × 300 tokens × ~4 bytes ≈ 3 GB of raw text. With OpenSearch overhead (inverted index + k-NN graph) plan for ~10× = ~30 GB = ~$4/month. Storage is never the problem.

Cost optimisation levers

These are ordered by impact:

1. Anthropic prompt caching — saves 20–30% on Claude costs Enable cache_control on the system prompt and the tool definitions block. Cached input tokens are billed at $0.30/MTok (90% discount vs $3/MTok). The system prompt (~500 tokens) and tool list (~300 tokens) are identical on every call and qualify for caching immediately.

2. Route simple queries to Claude Haiku — saves up to 60% on Claude costs Haiku 4.5 costs $0.80/MTok in and $4/MTok out — roughly 4× cheaper than Sonnet. Add a classifier that sends single-fact lookups ("What is a ConfigMap?") to Haiku and only escalates multi-hop questions ("What changed in our deployment process and why?") to Sonnet. If 60% of queries qualify, medium-scale Claude spend drops from ~$780 to ~$390/month.

3. Redis query caching — saves 20–40% on Claude costs for repeated questions Teams ask the same questions. "How do I request VPN access?" is asked by every new joiner. A Redis cache keyed on a normalised query hash with a 15-minute TTL (already in the architecture diagram) eliminates the Claude round-trip entirely for cache hits. Common internal Q&A workloads see 25–35% cache hit rates.

4. Reserved instances for OpenSearch — saves 30–40% on compute A 1-year Reserved Instance for m6g.large.search drops from $0.148/hr to ~$0.088/hr. On three nodes that saves ~$215/month (from $320 to $190 per 3-node cluster). Commit only after validating instance size in production.

5. Fargate Spot for ingestion workers — saves ~70% on worker compute Ingestion workers are interruptible — if a Spot interruption occurs, SQS re-delivers the message and the worker retries. Switch the ECS task definition to use FARGATE_SPOT capacity provider. At medium scale this saves ~$10/month; at large scale ~$28/month.

6. Cohere free tier for small teams The Cohere trial tier gives 1 000 free rerank calls/month. A team of 50 users making 3 searches/day averages ~4 500 calls/month — just over the free limit. Reduce top_k from 50 to 25 candidates sent to the reranker to halve call volume at a small quality cost.

Realistic optimised costs

Applying caching, Haiku routing (60% of queries), and reserved instances:

Small

Medium

Large

Baseline estimate

~$203

~$1 335

~$6 047

After optimisations

~$110

~$620

~$2 900

Saving

~46%

~54%

~52%

Amazon Bedrock Knowledge Bases — cost comparison

The fully managed path trades control for simplicity but is not always cheaper:

Component

Monthly cost

Bedrock Titan Embeddings ($0.0001/1K tokens)

~$0.15 (initial), <$1 ongoing

OpenSearch Serverless (minimum 4 OCUs)

~$700

Bedrock Retrieve API calls

~$0.10 per 1 000 calls

Anthropic Claude (same as above)

same

The OpenSearch Serverless minimum of 4 OCUs (~$700/month) makes Bedrock Knowledge Bases more expensive than self-managed OpenSearch at small and medium scale. It becomes cost-competitive only above ~2 M chunks where you need multiple OpenSearch data nodes anyway. The main argument for Bedrock Knowledge Bases is not cost — it is engineering time saved on the ingestion pipeline.

Disclaimer: All figures are estimates based on public AWS and API pricing as of mid-2026. Actual costs depend on your specific usage patterns, AWS region, negotiated enterprise pricing, and data transfer costs. Use the AWS Pricing Calculator for precise projections before committing to an architecture.


Beyond vector databases — better retrieval approaches

Swapping one vector database for another improves scale and operational robustness but does almost nothing for retrieval quality. The two approaches below address the actual quality bottlenecks for a Confluence-sized corpus.


Approach 1 — Contextual Retrieval

What it is: Before indexing each chunk, ask Claude to prepend a short context paragraph describing where the chunk sits in the document, what the page is about, and why this section matters. Then index with BM25 only — no dense embeddings at all — and apply the existing Cohere reranker on top.

Anthropic published benchmarks in late 2024 showing this approach achieves 49% fewer retrieval failures compared to naive BM25 + dense hybrid. BM25 with prepended context outperforms BM25 + dense without it.

Why Confluence chunks need this: When you strip HTML from a Confluence page and split it into 1 500-character chunks, the chunks lose their surrounding context. A chunk that says "set the flag to true to enable this feature" scores well for the query "how do I enable features" but is useless without knowing which page it came from and which feature it refers to. The prepended context sentence — "This chunk is from the Engineering Handbook, section on Feature Flags, describing how to enable a new flag in the production config service" — makes the chunk self-contained and dramatically more retrievable.

What changes in the pipeline:

Step

Without contextual retrieval

With contextual retrieval

Indexing

chunk → BM25 + embed → store

chunk → Claude context → contextual chunk → BM25 → store

Dense embeddings

Required

Removed entirely

Vector database

Required

Not needed

BM25 index

Required

Required (same)

Reranker

Required

Required (same)

MCP server

Unchanged

Unchanged

Agent

Unchanged

Unchanged

One-time indexing cost (Claude Haiku at $0.80/MTok):

Corpus size

Chunks

Context tokens

Indexing cost

10 000 pages

50 000

~10 M tokens

~$8

100 000 pages

500 000

~100 M tokens

~$80

500 000 pages

2 500 000

~500 M tokens

~$400

This is a one-time cost per full re-index, not a recurring monthly expense. Delta syncs (only changed pages) are proportionally cheaper.

Running cost impact: Removing dense embeddings eliminates the OpenAI embedding API call on every query (~$0.02/1M tokens, small but real), removes the vector index from OpenSearch or Qdrant (reducing storage by 30–50%), and simplifies the retrieval code to a single BM25 query path.

AWS implementation: Keep OpenSearch Service for BM25. Remove the k-NN plugin configuration and dense vector field entirely. Add a pre-indexing ECS task that calls the Anthropic API to generate context for each chunk before the ingestion worker writes to OpenSearch.


Approach 2 — GraphRAG

What it is: Instead of treating Confluence as a flat collection of text chunks, build a knowledge graph from it — extracting entities, relationships, and summaries — and answer questions by traversing the graph rather than scoring chunks by similarity.

Microsoft Research published GraphRAG in 2024 and showed 20–70% improvement over naive RAG on complex multi-hop questions depending on question type. The gains are largest on exactly the investigative questions that Confluence is used for.

Why Confluence is a natural graph: Confluence already has rich structure that flat retrieval throws away:

Space
  └── Section page
        ├── Child page A  ──links to──► ADR-042
        │     └── Child page A1         │
        └── Child page B                └──triggered by──► Incident-2024-Q3
              (author: team-platform)                        (owner: team-sre)

When a user asks "What changed in our deployment process last quarter and why?", the answer requires:

  1. Find the current deployment process page

  2. Follow links to the ADR that modified it

  3. Find the incident report that triggered the ADR

  4. Synthesise the chain of causality across three pages

A flat vector search returns the chunks with the highest similarity score. A graph traversal follows the actual structure of the knowledge.

How GraphRAG works for Confluence:

Ingestion
  │
  ├─► Extract entities from each page
  │     (system names, team names, process names, ticket IDs, dates)
  │
  ├─► Extract relationships between entities
  │     (page A links to page B, process X was changed by ADR Y,
  │      incident Z triggered decision W)
  │
  ├─► Build community summaries
  │     (cluster related pages into topics, summarise each cluster)
  │
  └─► Store in graph database (Neptune) + keep BM25 for keyword search

Query
  │
  ├─► Global questions ("what are our main deployment processes?")
  │     → community summary traversal, no chunk retrieval needed
  │
  └─► Local questions ("how do I deploy service X?")
        → entity lookup → graph hop → retrieve relevant pages → answer

Two query modes:

Mode

Best for

How it works

Local search

Specific factual questions

Find entity in graph → traverse 1–2 hops → retrieve source pages → answer

Global search

Broad thematic questions

Query community summaries → synthesise across the whole corpus

AWS implementation: Amazon Neptune (fully managed graph database) for the knowledge graph. Neptune Analytics (in-memory graph engine, announced 2023) for fast traversal queries. The ingestion worker gains a graph extraction step that calls Claude to identify entities and relationships per page, then writes edges to Neptune alongside the existing OpenSearch BM25 upsert.

AWS architecture addition for GraphRAG:

Ingestion worker
  │
  ├─► (existing) chunk → BM25 upsert → OpenSearch
  │
  └─► (new) page full text → Claude entity extraction
                              → Neptune upsert (nodes + edges)
                              → Neptune Analytics community clustering (nightly)

Query (MCP server)
  │
  ├─► hybrid_search (existing BM25 + rerank path)
  │
  └─► graph_search (new tool)
        → Neptune Analytics traversal
        → fetch source pages
        → synthesise answer

The existing hybrid_search, get_page_full, and list_spaces MCP tools remain unchanged. graph_search is an additional fourth tool the agent can call for questions that require following relationships across pages.

Cost addition (medium scale, 100k pages):

Component

Monthly cost

Amazon Neptune (db.r6g.large)

~$200

Neptune Analytics (2 NCUs)

~$180

Claude Haiku entity extraction (initial, one-time)

~$50

Claude Haiku entity extraction (monthly delta, 5% change)

~$3

Total addition

~$383/month


Comparison across all approaches

Approach

Retrieval quality

Ops complexity

Monthly cost delta

Best for

Prototype (BM25+dense+RRF+rerank)

Good

Low

baseline

Getting started

Vector DB upgrade (Qdrant/OpenSearch kNN)

Good

Medium

+$0–200

Scale, not quality

Contextual Retrieval + BM25

Very good

Low

−$50–200 (saves embedding cost)

Best quality-to-effort ratio

GraphRAG

Excellent on multi-hop

High

+$350–600

Complex investigative questions

Contextual + GraphRAG

Excellent across all types

Very high

+$250–400 net

Full enterprise production


Phase 1 — Contextual Retrieval (week 1–2) Add a context-generation step before BM25 indexing. Remove dense embeddings and the vector index. This is the highest-ROI change: better retrieval quality, lower running cost, less infrastructure. The MCP server, agent, and chatbot are untouched.

Phase 2 — Vector DB for scale (month 2–3, if corpus > 500k chunks) If the corpus grows large enough that OpenSearch BM25 performance degrades, add a vector index (OpenSearch kNN or Qdrant) back in alongside the contextual BM25. At this scale the quality gain from contextual retrieval still applies on top.

Phase 3 — GraphRAG (month 4–6, if multi-hop questions dominate) Instrument user queries for 4–6 weeks after Phase 1. If a significant share of questions require tracing relationships across pages (history of a decision, ownership chains, impact of an incident), add the Neptune graph layer and the graph_search MCP tool. This is a non-trivial engineering investment — only do it when the query analysis confirms it is the right bottleneck to fix.

Key principle: The vector database is a storage and scale concern. Contextual Retrieval and GraphRAG are quality concerns. Fix quality first, then scale the storage layer to match the corpus size.

A
license - permissive license
-
quality - not tested
C
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/piter5285/hybrid-agentic-RAG'

If you have feedback or need assistance with the MCP directory API, please join our Discord server