Doclea MCP

Official

Overview Schema Related Servers Score Discussions

doclea-mcp
documentation

live-codex-ab-benchmark-next-steps-2026-02-16.md•15.2 KiB

# Live Codex A/B Benchmark - Next Steps (2026-02-16) ## Goal Run a **real-world** benchmark with: - `Codex without MCP` (`grep_tools` arm) - `Codex + Doclea MCP` (`mcp_full` arm) on target repo: `/home/pho7on/tin/monorepo/main` ## What Was Completed Today - Added payload-vs-provider token accounting in `scripts/mcp-vs-grep-choice-benchmark.ts`. - New env: `DOCLEA_CHOICE_INPUT_TOKEN_SOURCE=payload|provider` (default `payload`). - Report now includes token accounting metadata. - Added real-world Codex mode in `scripts/mcp-vs-grep-choice-benchmark.ts`. - New env: `DOCLEA_CHOICE_REALWORLD_CODEX=true`. - In this mode, scoring uses files from Codex JSON output (`cited_files`) instead of precomputed retrieval context. - Supports only `mcp_full` and `grep_tools` in real-world mode. - Extended metadata to carry `projectPath` into CLI request. - Updated `scripts/lib/llm-cli-runner.ts` schema to allow `metadata.projectPath`. - Reworked `scripts/llm-cli-codex-adapter.ts` to support isolated Codex config per run: - `DOCLEA_CODEX_USE_ISOLATED_CONFIG=true` - Enables mode-based MCP setup (`mcp_full` => doclea MCP attached, `grep_tools` => no doclea MCP). - Passes `--cd <projectPath>` to Codex so it runs in target repo. - Fixed MCP stdio corruption issue by redirecting `console.log` to `stderr` in `src/index.ts`. - This prevents Doclea MCP stdout logs from breaking protocol frames. ## Current Status - Formatting checks pass for modified files. - Codex adapter was validated successfully for `mcp_full` mode with doclea MCP attached. - Full benchmark rerun was **not completed yet**. - One verification command for `grep_tools` arm was interrupted and should be rerun. ## Latest Authoritative Result (2026-02-16) This is the latest completed measured run and should be treated as the current baseline. - Run: - Model: `gpt-5-codex` - Reasoning: `medium` - Fixture: `/home/pho7on/tin/monorepo/main/.doclea/retrieval-agent-choice-queries.monorepo.json` (5 validated queries) - Choice report: `/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-vs-grep-choice-benchmark.measured.serial.realworld.validsubset.full.gpt5-codex.medium.json` - Six-part report: `/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-six-part-benchmark.measured.serial.realworld.validsubset.full.gpt5-codex.medium.json` - Six-part HTML: `/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-six-part-benchmark.measured.serial.realworld.validsubset.full.gpt5-codex.medium.dark.html` - Result summary (`mcp_full` vs `grep_tools`): - Localization recall: `0.7736` vs `0.7918` (MCP behind by `1.82` points) - Localization precision: `0.6967` vs `0.7400` (MCP behind by `4.33` points) - Avg end-to-end per query: `2:09.5` vs `2:52.0` (MCP faster) - Modeled task time: `4:12.0` vs `4:41.3` (MCP faster) - Doc-drift triage (`mcp_hybrid_guardrail` vs `grep_tools`): MCP is strongly ahead (`+62.5` found-rate points, `-65.8%` tokens, `1.90x` faster) ## Focus To Beat Pure LLM (Priority) Primary goal: make `mcp_full` win on localization quality while keeping the current speed advantage. 1. Fix query classes where MCP quality is trailing. - Main misses in latest run: - `q-outbox-transactional-dispatch` (precision gap) - `q-scoring-forwarding-pipeline` (recall + precision gap) - `q-worker-config-env-chain` (precision gap) 2. Add low-confidence fallback expansion in `mcp_full`. - If citation confidence is low or cross-scope coverage is weak, run targeted `rg` expansion before finalizing cited files. 3. Increase scope coverage guarantees in retrieval assembly. - Enforce at least one high-signal file per detected subsystem scope before final answer serialization. 4. Tighten citation post-filtering for precision. - Keep strict existence checks, but add relevance re-rank to drop weak extra files that hurt precision. 5. Tune `mcp_full` ranking weights on this fixture. - Run focused ablations on semantic vs lexical vs graph boosts using only the 5 validated queries and query-level deltas. 6. Gate on explicit win criteria before calling success. - Required at token budget `3500`: - `mcp_full` recall >= `grep_tools` + `5` points - `mcp_full` precision >= `grep_tools` + `5` points - `mcp_full` end-to-end <= `grep_tools` + `10%` ## Live Continuation Update (2026-02-16) - Re-ran sanity verification for adapter mode gating: - `mcp_full`: `doclea_context` tool call succeeds. - `grep_tools`: `doclea_context` tool call fails/unavailable (expected). - Confirmed `projectPath` propagation into Codex execution: - `grep_tools` run cited `package.json:13` from `/home/pho7on/tin/monorepo/main`. - Updated adapter behavior: Doclea MCP server `cwd` now defaults to `metadata.projectPath` in isolated mode, so `doclea_context` uses the target repo config/index by default. - Validated vector store side: - Qdrant collection `doclea-memories-qwen` exists and is configured with vector size `1024`. - Prior embedding blocker (now resolved): - Existing TEI on `:8080` is `BAAI/bge-base-en-v1.5` and returns `768` dimensions. - Attempted to bring up Qwen embedding service on `:8180` using `ghcr.io/huggingface/text-embeddings-inference:latest` + `Qwen/Qwen3-Embedding-0.6B`. - Container starts but `/health` and `/embed` are not reachable during prolonged warmup on CPU. - Added strict benchmark preflight to prevent fallback to old stack. - Prepared fixture path for benchmark command: - `/tmp/tin-monorepo-choice-realworld.fast.json` (5 queries) - Embedding blocker resolved after Qwen-specific TEI tuning: - Started `mcp-embeddings-qwen-8180` with: - `--model-id Qwen/Qwen3-Embedding-0.6B` - `--auto-truncate` - `--max-batch-tokens 2048` - `--max-client-batch-size 8` - `--tokenization-workers 8` - Verified `curl http://localhost:8180/health` returns HTTP 200. - Verified `/embed` output dimension is `1024`. - Added benchmark-level Qwen preflight gate (`scripts/mcp-vs-grep-choice-benchmark.ts`): - Validates embedding endpoint, embedding dimension, Qdrant URL, collection name, and vector size before run. - Fails fast if stack drifts back to old `:8080`/non-Qwen settings. ## Live Test Result (2026-02-16, Smoke-1) - Executed strict Qwen measured smoke run (1 query, 2 modes) with: - Fixture: `/tmp/tin-monorepo-choice-realworld.smoke1.json` - Report: `/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-vs-grep-choice-benchmark.measured.serial.realworld.smoke1.json` - Preflight passed in-report: - `qwenPreflight.enabled = true` - `embeddingDimension = 1024` - `qdrantCollection = doclea-memories-qwen` - `qdrantVectorSize = 1024` - Measured timing (single run each mode): - `mcp_full` model request: `120450 ms` (~120.5s) - `grep_tools` model request: `200994 ms` (~201.0s) - Smoke quality for this single query is currently zero (`matchedFileCount=0` in both modes), with run details showing missing expected files for this prompt. - A prior attempt to run the full 5-query fixture was stopped due runtime; use smoke result only as pipeline validation, not final quality conclusion. ## Live Test Result (2026-02-16, Full 5-Query Run) - Completed full strict-Qwen measured run with: - Fixture: `/tmp/tin-monorepo-choice-realworld.fast.json` - Report: `/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-vs-grep-choice-benchmark.measured.serial.realworld.fast.json` - `DOCLEA_LIVE_LLM_TIMEOUT_MS=420000` (240000 timed out on one long call) - Report-level validation: - `timingSource = measured` - `realworldCodex = true` - `tokenAccounting.inputTokensSource = payload` - Qwen preflight in report is valid (`1024` embedding/vector size, Qdrant collection `doclea-memories-qwen`) - Measured timing summary (warm, runs=5 each): - `mcp_full`: - `modelRequestAvg = 188646 ms` - `modelRequestP95 = 241920 ms` - `grep_tools`: - `modelRequestAvg = 230639 ms` - `modelRequestP95 = 407502 ms` - Quality caveat (important): - `fileRecallAvg=0`, `filePrecisionAvg=0`, `wrongPathRatioAvg=1` for both modes. - Root cause: fixture expected file paths are not present in `/home/pho7on/tin/monorepo/main` (e.g. `scripts/mcp-vs-grep-choice-benchmark.ts`, `src/config.ts`), so quality scoring is currently invalid for this target repo. - Treat this run as timing/infra validation only until fixture expectations are aligned to files that actually exist in the benchmark target repo. - Regenerated six-part artifacts from this full report: - JSON: `/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-six-part-benchmark.measured.serial.realworld.fast.json` - HTML: `/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-six-part-benchmark.measured.serial.realworld.fast.dark.html` - Generation summary: `queryCount=5`, `driftQueries=8` ## Correction (2026-02-16, Validated) - Confirmed root-cause of flat non-drift metrics: - Invalid fixture for target repo was used (`/tmp/tin-monorepo-choice-realworld.fast.json` copied from `documentation/retrieval/live-choice-queries.realworld.json`, which references files from `doclea/mcp`, not `tin/monorepo/main`). - Correct fixture for `tin/monorepo/main`: - `/home/pho7on/tin/monorepo/main/.doclea/retrieval-agent-choice-queries.monorepo.json` - Corrected 1-query smoke run (using monorepo fixture subset): - Report: `/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-vs-grep-choice-benchmark.measured.serial.realworld.corrected-smoke1.json` - `mcp_full`: `fileRecall=1.0`, `filePrecision=0.55`, `modelRequestMs=196780` - `grep_tools`: `fileRecall=0.8182`, `filePrecision=0.45`, `modelRequestMs=244916` - This confirms quality is not intrinsically flat; prior flat result came from fixture-target mismatch. ## Remaining Work (Tomorrow) ### 1) Quick sanity checks - [x] Verify adapter behavior for both modes: - [x] `mcp_full` uses doclea MCP - [x] `grep_tools` runs without doclea MCP - [x] Confirm Codex runs in target repo path via metadata `projectPath`. ### 2) Decide embedding/vector stack for final run - Preferred (as requested): Qwen-compatible stack - [x] Bring up embedding service on `:8180` - [x] Verify embedding dimension is `1024` - [x] Verify Qdrant collection `doclea-memories-qwen` is `1024` - If stack is not ready, stop and fix infra first (do not run final benchmark on fallback stack). ### 2.1) Infra unblock commands (next) Run one of the following before step 3: ```bash # Inspect current Qwen TEI startup state docker logs --tail 200 mcp-embeddings-qwen-8180 curl --max-time 2 -sS http://localhost:8180/health ``` ```bash # One-command Qwen CPU launcher + dimension validator ./scripts/run-qwen-cpu-embeddings.sh ``` ```bash # Compose profile for Qwen CPU stack (no BGE endpoint) docker compose -f docker-compose.qwen-cpu.yml up -d ``` ```bash # Recreate Qwen TEI with tuned settings used in live run docker rm -f mcp-embeddings-qwen-8180 docker run -d --name mcp-embeddings-qwen-8180 -p 8180:80 \ -e HF_HUB_ENABLE_HF_TRANSFER=0 \ -e HF_ENDPOINT=https://huggingface.co \ -v embeddings_qwen_cache:/data \ ghcr.io/huggingface/text-embeddings-inference:latest \ --model-id Qwen/Qwen3-Embedding-0.6B \ --auto-truncate \ --max-batch-tokens 2048 \ --max-client-batch-size 8 \ --tokenization-workers 8 \ --port 80 ``` ### 3) Run real-world measured choice benchmark (A/B) Use this exact command pattern: ```bash DOCLEA_BENCH_PROJECT_PATH=/home/pho7on/tin/monorepo/main \ DOCLEA_CHOICE_FIXTURE_PATH=/home/pho7on/tin/monorepo/main/.doclea/retrieval-agent-choice-queries.monorepo.json \ DOCLEA_CHOICE_REPORT_JSON_PATH=/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-vs-grep-choice-benchmark.measured.serial.realworld.fast.json \ DOCLEA_CHOICE_TIMING_MODE=measured \ DOCLEA_CHOICE_REALWORLD_CODEX=true \ DOCLEA_CHOICE_REQUIRE_QWEN_STACK=true \ DOCLEA_CHOICE_QWEN_EMBED_ENDPOINT=http://localhost:8180 \ DOCLEA_CHOICE_QWEN_QDRANT_URL=http://localhost:6333 \ DOCLEA_CHOICE_QWEN_COLLECTION=doclea-memories-qwen \ DOCLEA_CHOICE_QWEN_VECTOR_SIZE=1024 \ DOCLEA_CHOICE_INPUT_TOKEN_SOURCE=payload \ DOCLEA_CHOICE_CONCURRENCY=1 \ DOCLEA_CHOICE_RUNS_PER_QUERY=1 \ DOCLEA_CHOICE_WARMUP_RUNS=0 \ DOCLEA_CHOICE_RUN_KINDS=warm \ DOCLEA_CHOICE_MODES=mcp_full,grep_tools \ DOCLEA_CHOICE_TOKEN_BUDGET=3500 \ DOCLEA_LOCAL_EMBED_PROFILE=qwen_cpu \ DOCLEA_LOCAL_EMBED_MAX_BATCH_SIZE=8 \ DOCLEA_LOCAL_EMBED_TIMEOUT_MS=180000 \ DOCLEA_LIVE_LLM_CLI_COMMAND='DOCLEA_CODEX_USE_ISOLATED_CONFIG=true bun run scripts/llm-cli-codex-adapter.ts' \ DOCLEA_LIVE_LLM_MODEL=gpt-5.3-codex \ DOCLEA_LIVE_LLM_TEMPERATURE=0 \ DOCLEA_LIVE_LLM_MAX_OUTPUT_TOKENS=220 \ DOCLEA_LIVE_LLM_TIMEOUT_MS=420000 \ bun run scripts/mcp-vs-grep-choice-benchmark.ts ``` Optional quick validation command (already executed once in this session): ```bash DOCLEA_BENCH_PROJECT_PATH=/home/pho7on/tin/monorepo/main \ DOCLEA_CHOICE_FIXTURE_PATH=/tmp/tin-monorepo-choice-corrected.smoke1.json \ DOCLEA_CHOICE_REPORT_JSON_PATH=/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-vs-grep-choice-benchmark.measured.serial.realworld.smoke1.json \ DOCLEA_CHOICE_TIMING_MODE=measured \ DOCLEA_CHOICE_REALWORLD_CODEX=true \ DOCLEA_CHOICE_REQUIRE_QWEN_STACK=true \ DOCLEA_CHOICE_QWEN_EMBED_ENDPOINT=http://localhost:8180 \ DOCLEA_CHOICE_QWEN_QDRANT_URL=http://localhost:6333 \ DOCLEA_CHOICE_QWEN_COLLECTION=doclea-memories-qwen \ DOCLEA_CHOICE_QWEN_VECTOR_SIZE=1024 \ DOCLEA_CHOICE_INPUT_TOKEN_SOURCE=payload \ DOCLEA_CHOICE_CONCURRENCY=1 \ DOCLEA_CHOICE_RUNS_PER_QUERY=1 \ DOCLEA_CHOICE_WARMUP_RUNS=0 \ DOCLEA_CHOICE_RUN_KINDS=warm \ DOCLEA_CHOICE_MODES=mcp_full,grep_tools \ DOCLEA_CHOICE_TOKEN_BUDGET=3500 \ DOCLEA_LOCAL_EMBED_PROFILE=qwen_cpu \ DOCLEA_LOCAL_EMBED_MAX_BATCH_SIZE=8 \ DOCLEA_LOCAL_EMBED_TIMEOUT_MS=180000 \ DOCLEA_LIVE_LLM_CLI_COMMAND='DOCLEA_CODEX_USE_ISOLATED_CONFIG=true bun run scripts/llm-cli-codex-adapter.ts' \ DOCLEA_LIVE_LLM_MODEL=gpt-5.3-codex \ DOCLEA_LIVE_LLM_TEMPERATURE=0 \ DOCLEA_LIVE_LLM_MAX_OUTPUT_TOKENS=220 \ DOCLEA_LIVE_LLM_TIMEOUT_MS=420000 \ bun run scripts/mcp-vs-grep-choice-benchmark.ts ``` ### 4) Regenerate six-part report + HTML ```bash DOCLEA_BENCH_PROJECT_PATH=/home/pho7on/tin/monorepo/main \ DOCLEA_SIX_SOURCE_REPORTS='3500:/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-vs-grep-choice-benchmark.measured.serial.realworld.fast.json' \ DOCLEA_SIX_REPORT_JSON_PATH=/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-six-part-benchmark.measured.serial.realworld.fast.json \ bun run scripts/mcp-six-part-benchmark.ts DOCLEA_SIX_REPORT_JSON_PATH=/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-six-part-benchmark.measured.serial.realworld.fast.json \ DOCLEA_SIX_REPORT_HTML_PATH=/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-six-part-benchmark.measured.serial.realworld.fast.dark.html \ bun run scripts/mcp-six-part-presentation-html.ts ``` ### 5) Validate report realism - [x] Confirm report `timingSource = measured`. - [x] Confirm `realworldCodex = true` in choice report. - [x] Confirm token accounting uses payload tokens for comparisons. - [x] Check that `mcp_full` vs `grep_tools` timing numbers are plausible. - [ ] Spot-check at least one query output and cited files per arm. - Blocked for quality interpretation until fixture `expectedFilePaths` match files that exist in `/home/pho7on/tin/monorepo/main`. ## Modified Files to Continue From - `scripts/mcp-vs-grep-choice-benchmark.ts` - `scripts/llm-cli-codex-adapter.ts` - `scripts/lib/llm-cli-runner.ts` - `src/index.ts` ## Notes - Do not conclude anything from prior `1.4M vs 45k` as a fairness result. - Final conclusion must use the real-world Codex A/B run above.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/docleaai/doclea-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

live-codex-ab-benchmark-next-steps-2026-02-16.md•15.2 KiB