# Live Codex A/B Benchmark - Next Steps (2026-02-16)
## Goal
Run a **real-world** benchmark with:
- `Codex without MCP` (`grep_tools` arm)
- `Codex + Doclea MCP` (`mcp_full` arm)
on target repo: `/home/pho7on/tin/monorepo/main`
## What Was Completed Today
- Added payload-vs-provider token accounting in `scripts/mcp-vs-grep-choice-benchmark.ts`.
- New env: `DOCLEA_CHOICE_INPUT_TOKEN_SOURCE=payload|provider` (default `payload`).
- Report now includes token accounting metadata.
- Added real-world Codex mode in `scripts/mcp-vs-grep-choice-benchmark.ts`.
- New env: `DOCLEA_CHOICE_REALWORLD_CODEX=true`.
- In this mode, scoring uses files from Codex JSON output (`cited_files`) instead of precomputed retrieval context.
- Supports only `mcp_full` and `grep_tools` in real-world mode.
- Extended metadata to carry `projectPath` into CLI request.
- Updated `scripts/lib/llm-cli-runner.ts` schema to allow `metadata.projectPath`.
- Reworked `scripts/llm-cli-codex-adapter.ts` to support isolated Codex config per run:
- `DOCLEA_CODEX_USE_ISOLATED_CONFIG=true`
- Enables mode-based MCP setup (`mcp_full` => doclea MCP attached, `grep_tools` => no doclea MCP).
- Passes `--cd <projectPath>` to Codex so it runs in target repo.
- Fixed MCP stdio corruption issue by redirecting `console.log` to `stderr` in `src/index.ts`.
- This prevents Doclea MCP stdout logs from breaking protocol frames.
## Current Status
- Formatting checks pass for modified files.
- Codex adapter was validated successfully for `mcp_full` mode with doclea MCP attached.
- Full benchmark rerun was **not completed yet**.
- One verification command for `grep_tools` arm was interrupted and should be rerun.
## Latest Authoritative Result (2026-02-16)
This is the latest completed measured run and should be treated as the current baseline.
- Run:
- Model: `gpt-5-codex`
- Reasoning: `medium`
- Fixture: `/home/pho7on/tin/monorepo/main/.doclea/retrieval-agent-choice-queries.monorepo.json` (5 validated queries)
- Choice report: `/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-vs-grep-choice-benchmark.measured.serial.realworld.validsubset.full.gpt5-codex.medium.json`
- Six-part report: `/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-six-part-benchmark.measured.serial.realworld.validsubset.full.gpt5-codex.medium.json`
- Six-part HTML: `/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-six-part-benchmark.measured.serial.realworld.validsubset.full.gpt5-codex.medium.dark.html`
- Result summary (`mcp_full` vs `grep_tools`):
- Localization recall: `0.7736` vs `0.7918` (MCP behind by `1.82` points)
- Localization precision: `0.6967` vs `0.7400` (MCP behind by `4.33` points)
- Avg end-to-end per query: `2:09.5` vs `2:52.0` (MCP faster)
- Modeled task time: `4:12.0` vs `4:41.3` (MCP faster)
- Doc-drift triage (`mcp_hybrid_guardrail` vs `grep_tools`): MCP is strongly ahead (`+62.5` found-rate points, `-65.8%` tokens, `1.90x` faster)
## Focus To Beat Pure LLM (Priority)
Primary goal: make `mcp_full` win on localization quality while keeping the current speed advantage.
1. Fix query classes where MCP quality is trailing.
- Main misses in latest run:
- `q-outbox-transactional-dispatch` (precision gap)
- `q-scoring-forwarding-pipeline` (recall + precision gap)
- `q-worker-config-env-chain` (precision gap)
2. Add low-confidence fallback expansion in `mcp_full`.
- If citation confidence is low or cross-scope coverage is weak, run targeted `rg` expansion before finalizing cited files.
3. Increase scope coverage guarantees in retrieval assembly.
- Enforce at least one high-signal file per detected subsystem scope before final answer serialization.
4. Tighten citation post-filtering for precision.
- Keep strict existence checks, but add relevance re-rank to drop weak extra files that hurt precision.
5. Tune `mcp_full` ranking weights on this fixture.
- Run focused ablations on semantic vs lexical vs graph boosts using only the 5 validated queries and query-level deltas.
6. Gate on explicit win criteria before calling success.
- Required at token budget `3500`:
- `mcp_full` recall >= `grep_tools` + `5` points
- `mcp_full` precision >= `grep_tools` + `5` points
- `mcp_full` end-to-end <= `grep_tools` + `10%`
## Live Continuation Update (2026-02-16)
- Re-ran sanity verification for adapter mode gating:
- `mcp_full`: `doclea_context` tool call succeeds.
- `grep_tools`: `doclea_context` tool call fails/unavailable (expected).
- Confirmed `projectPath` propagation into Codex execution:
- `grep_tools` run cited `package.json:13` from `/home/pho7on/tin/monorepo/main`.
- Updated adapter behavior: Doclea MCP server `cwd` now defaults to `metadata.projectPath` in isolated mode, so `doclea_context` uses the target repo config/index by default.
- Validated vector store side:
- Qdrant collection `doclea-memories-qwen` exists and is configured with vector size `1024`.
- Prior embedding blocker (now resolved):
- Existing TEI on `:8080` is `BAAI/bge-base-en-v1.5` and returns `768` dimensions.
- Attempted to bring up Qwen embedding service on `:8180` using
`ghcr.io/huggingface/text-embeddings-inference:latest` + `Qwen/Qwen3-Embedding-0.6B`.
- Container starts but `/health` and `/embed` are not reachable during prolonged warmup on CPU.
- Added strict benchmark preflight to prevent fallback to old stack.
- Prepared fixture path for benchmark command:
- `/tmp/tin-monorepo-choice-realworld.fast.json` (5 queries)
- Embedding blocker resolved after Qwen-specific TEI tuning:
- Started `mcp-embeddings-qwen-8180` with:
- `--model-id Qwen/Qwen3-Embedding-0.6B`
- `--auto-truncate`
- `--max-batch-tokens 2048`
- `--max-client-batch-size 8`
- `--tokenization-workers 8`
- Verified `curl http://localhost:8180/health` returns HTTP 200.
- Verified `/embed` output dimension is `1024`.
- Added benchmark-level Qwen preflight gate (`scripts/mcp-vs-grep-choice-benchmark.ts`):
- Validates embedding endpoint, embedding dimension, Qdrant URL, collection name, and vector size before run.
- Fails fast if stack drifts back to old `:8080`/non-Qwen settings.
## Live Test Result (2026-02-16, Smoke-1)
- Executed strict Qwen measured smoke run (1 query, 2 modes) with:
- Fixture: `/tmp/tin-monorepo-choice-realworld.smoke1.json`
- Report: `/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-vs-grep-choice-benchmark.measured.serial.realworld.smoke1.json`
- Preflight passed in-report:
- `qwenPreflight.enabled = true`
- `embeddingDimension = 1024`
- `qdrantCollection = doclea-memories-qwen`
- `qdrantVectorSize = 1024`
- Measured timing (single run each mode):
- `mcp_full` model request: `120450 ms` (~120.5s)
- `grep_tools` model request: `200994 ms` (~201.0s)
- Smoke quality for this single query is currently zero (`matchedFileCount=0` in both modes), with run details showing missing expected files for this prompt.
- A prior attempt to run the full 5-query fixture was stopped due runtime; use smoke result only as pipeline validation, not final quality conclusion.
## Live Test Result (2026-02-16, Full 5-Query Run)
- Completed full strict-Qwen measured run with:
- Fixture: `/tmp/tin-monorepo-choice-realworld.fast.json`
- Report: `/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-vs-grep-choice-benchmark.measured.serial.realworld.fast.json`
- `DOCLEA_LIVE_LLM_TIMEOUT_MS=420000` (240000 timed out on one long call)
- Report-level validation:
- `timingSource = measured`
- `realworldCodex = true`
- `tokenAccounting.inputTokensSource = payload`
- Qwen preflight in report is valid (`1024` embedding/vector size, Qdrant collection `doclea-memories-qwen`)
- Measured timing summary (warm, runs=5 each):
- `mcp_full`:
- `modelRequestAvg = 188646 ms`
- `modelRequestP95 = 241920 ms`
- `grep_tools`:
- `modelRequestAvg = 230639 ms`
- `modelRequestP95 = 407502 ms`
- Quality caveat (important):
- `fileRecallAvg=0`, `filePrecisionAvg=0`, `wrongPathRatioAvg=1` for both modes.
- Root cause: fixture expected file paths are not present in `/home/pho7on/tin/monorepo/main` (e.g. `scripts/mcp-vs-grep-choice-benchmark.ts`, `src/config.ts`), so quality scoring is currently invalid for this target repo.
- Treat this run as timing/infra validation only until fixture expectations are aligned to files that actually exist in the benchmark target repo.
- Regenerated six-part artifacts from this full report:
- JSON: `/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-six-part-benchmark.measured.serial.realworld.fast.json`
- HTML: `/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-six-part-benchmark.measured.serial.realworld.fast.dark.html`
- Generation summary: `queryCount=5`, `driftQueries=8`
## Correction (2026-02-16, Validated)
- Confirmed root-cause of flat non-drift metrics:
- Invalid fixture for target repo was used (`/tmp/tin-monorepo-choice-realworld.fast.json` copied from `documentation/retrieval/live-choice-queries.realworld.json`, which references files from `doclea/mcp`, not `tin/monorepo/main`).
- Correct fixture for `tin/monorepo/main`:
- `/home/pho7on/tin/monorepo/main/.doclea/retrieval-agent-choice-queries.monorepo.json`
- Corrected 1-query smoke run (using monorepo fixture subset):
- Report: `/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-vs-grep-choice-benchmark.measured.serial.realworld.corrected-smoke1.json`
- `mcp_full`: `fileRecall=1.0`, `filePrecision=0.55`, `modelRequestMs=196780`
- `grep_tools`: `fileRecall=0.8182`, `filePrecision=0.45`, `modelRequestMs=244916`
- This confirms quality is not intrinsically flat; prior flat result came from fixture-target mismatch.
## Remaining Work (Tomorrow)
### 1) Quick sanity checks
- [x] Verify adapter behavior for both modes:
- [x] `mcp_full` uses doclea MCP
- [x] `grep_tools` runs without doclea MCP
- [x] Confirm Codex runs in target repo path via metadata `projectPath`.
### 2) Decide embedding/vector stack for final run
- Preferred (as requested): Qwen-compatible stack
- [x] Bring up embedding service on `:8180`
- [x] Verify embedding dimension is `1024`
- [x] Verify Qdrant collection `doclea-memories-qwen` is `1024`
- If stack is not ready, stop and fix infra first (do not run final benchmark on fallback stack).
### 2.1) Infra unblock commands (next)
Run one of the following before step 3:
```bash
# Inspect current Qwen TEI startup state
docker logs --tail 200 mcp-embeddings-qwen-8180
curl --max-time 2 -sS http://localhost:8180/health
```
```bash
# One-command Qwen CPU launcher + dimension validator
./scripts/run-qwen-cpu-embeddings.sh
```
```bash
# Compose profile for Qwen CPU stack (no BGE endpoint)
docker compose -f docker-compose.qwen-cpu.yml up -d
```
```bash
# Recreate Qwen TEI with tuned settings used in live run
docker rm -f mcp-embeddings-qwen-8180
docker run -d --name mcp-embeddings-qwen-8180 -p 8180:80 \
-e HF_HUB_ENABLE_HF_TRANSFER=0 \
-e HF_ENDPOINT=https://huggingface.co \
-v embeddings_qwen_cache:/data \
ghcr.io/huggingface/text-embeddings-inference:latest \
--model-id Qwen/Qwen3-Embedding-0.6B \
--auto-truncate \
--max-batch-tokens 2048 \
--max-client-batch-size 8 \
--tokenization-workers 8 \
--port 80
```
### 3) Run real-world measured choice benchmark (A/B)
Use this exact command pattern:
```bash
DOCLEA_BENCH_PROJECT_PATH=/home/pho7on/tin/monorepo/main \
DOCLEA_CHOICE_FIXTURE_PATH=/home/pho7on/tin/monorepo/main/.doclea/retrieval-agent-choice-queries.monorepo.json \
DOCLEA_CHOICE_REPORT_JSON_PATH=/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-vs-grep-choice-benchmark.measured.serial.realworld.fast.json \
DOCLEA_CHOICE_TIMING_MODE=measured \
DOCLEA_CHOICE_REALWORLD_CODEX=true \
DOCLEA_CHOICE_REQUIRE_QWEN_STACK=true \
DOCLEA_CHOICE_QWEN_EMBED_ENDPOINT=http://localhost:8180 \
DOCLEA_CHOICE_QWEN_QDRANT_URL=http://localhost:6333 \
DOCLEA_CHOICE_QWEN_COLLECTION=doclea-memories-qwen \
DOCLEA_CHOICE_QWEN_VECTOR_SIZE=1024 \
DOCLEA_CHOICE_INPUT_TOKEN_SOURCE=payload \
DOCLEA_CHOICE_CONCURRENCY=1 \
DOCLEA_CHOICE_RUNS_PER_QUERY=1 \
DOCLEA_CHOICE_WARMUP_RUNS=0 \
DOCLEA_CHOICE_RUN_KINDS=warm \
DOCLEA_CHOICE_MODES=mcp_full,grep_tools \
DOCLEA_CHOICE_TOKEN_BUDGET=3500 \
DOCLEA_LOCAL_EMBED_PROFILE=qwen_cpu \
DOCLEA_LOCAL_EMBED_MAX_BATCH_SIZE=8 \
DOCLEA_LOCAL_EMBED_TIMEOUT_MS=180000 \
DOCLEA_LIVE_LLM_CLI_COMMAND='DOCLEA_CODEX_USE_ISOLATED_CONFIG=true bun run scripts/llm-cli-codex-adapter.ts' \
DOCLEA_LIVE_LLM_MODEL=gpt-5.3-codex \
DOCLEA_LIVE_LLM_TEMPERATURE=0 \
DOCLEA_LIVE_LLM_MAX_OUTPUT_TOKENS=220 \
DOCLEA_LIVE_LLM_TIMEOUT_MS=420000 \
bun run scripts/mcp-vs-grep-choice-benchmark.ts
```
Optional quick validation command (already executed once in this session):
```bash
DOCLEA_BENCH_PROJECT_PATH=/home/pho7on/tin/monorepo/main \
DOCLEA_CHOICE_FIXTURE_PATH=/tmp/tin-monorepo-choice-corrected.smoke1.json \
DOCLEA_CHOICE_REPORT_JSON_PATH=/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-vs-grep-choice-benchmark.measured.serial.realworld.smoke1.json \
DOCLEA_CHOICE_TIMING_MODE=measured \
DOCLEA_CHOICE_REALWORLD_CODEX=true \
DOCLEA_CHOICE_REQUIRE_QWEN_STACK=true \
DOCLEA_CHOICE_QWEN_EMBED_ENDPOINT=http://localhost:8180 \
DOCLEA_CHOICE_QWEN_QDRANT_URL=http://localhost:6333 \
DOCLEA_CHOICE_QWEN_COLLECTION=doclea-memories-qwen \
DOCLEA_CHOICE_QWEN_VECTOR_SIZE=1024 \
DOCLEA_CHOICE_INPUT_TOKEN_SOURCE=payload \
DOCLEA_CHOICE_CONCURRENCY=1 \
DOCLEA_CHOICE_RUNS_PER_QUERY=1 \
DOCLEA_CHOICE_WARMUP_RUNS=0 \
DOCLEA_CHOICE_RUN_KINDS=warm \
DOCLEA_CHOICE_MODES=mcp_full,grep_tools \
DOCLEA_CHOICE_TOKEN_BUDGET=3500 \
DOCLEA_LOCAL_EMBED_PROFILE=qwen_cpu \
DOCLEA_LOCAL_EMBED_MAX_BATCH_SIZE=8 \
DOCLEA_LOCAL_EMBED_TIMEOUT_MS=180000 \
DOCLEA_LIVE_LLM_CLI_COMMAND='DOCLEA_CODEX_USE_ISOLATED_CONFIG=true bun run scripts/llm-cli-codex-adapter.ts' \
DOCLEA_LIVE_LLM_MODEL=gpt-5.3-codex \
DOCLEA_LIVE_LLM_TEMPERATURE=0 \
DOCLEA_LIVE_LLM_MAX_OUTPUT_TOKENS=220 \
DOCLEA_LIVE_LLM_TIMEOUT_MS=420000 \
bun run scripts/mcp-vs-grep-choice-benchmark.ts
```
### 4) Regenerate six-part report + HTML
```bash
DOCLEA_BENCH_PROJECT_PATH=/home/pho7on/tin/monorepo/main \
DOCLEA_SIX_SOURCE_REPORTS='3500:/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-vs-grep-choice-benchmark.measured.serial.realworld.fast.json' \
DOCLEA_SIX_REPORT_JSON_PATH=/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-six-part-benchmark.measured.serial.realworld.fast.json \
bun run scripts/mcp-six-part-benchmark.ts
DOCLEA_SIX_REPORT_JSON_PATH=/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-six-part-benchmark.measured.serial.realworld.fast.json \
DOCLEA_SIX_REPORT_HTML_PATH=/home/pho7on/tin/monorepo/main/.doclea/reports/mcp-six-part-benchmark.measured.serial.realworld.fast.dark.html \
bun run scripts/mcp-six-part-presentation-html.ts
```
### 5) Validate report realism
- [x] Confirm report `timingSource = measured`.
- [x] Confirm `realworldCodex = true` in choice report.
- [x] Confirm token accounting uses payload tokens for comparisons.
- [x] Check that `mcp_full` vs `grep_tools` timing numbers are plausible.
- [ ] Spot-check at least one query output and cited files per arm.
- Blocked for quality interpretation until fixture `expectedFilePaths` match files that exist in `/home/pho7on/tin/monorepo/main`.
## Modified Files to Continue From
- `scripts/mcp-vs-grep-choice-benchmark.ts`
- `scripts/llm-cli-codex-adapter.ts`
- `scripts/lib/llm-cli-runner.ts`
- `src/index.ts`
## Notes
- Do not conclude anything from prior `1.4M vs 45k` as a fairness result.
- Final conclusion must use the real-world Codex A/B run above.