Thoughtbox

thoughtbox
docs
solutions
integration-issues

unify-evaluation-subsystems-langsmith.md

unify-evaluation-subsystems-langsmith.md•7.87 KiB

--- title: Unified Evaluation System Foundation with LangSmith Integration date: 2026-02-11 category: integration-issues severity: high components: - src/evaluation/ - src/observatory/emitter.ts - src/index.ts - package.json - .specs/SPEC-EVAL-001-unified-evaluation-system.md - dgm-specs/implementation-status.json tags: - evaluation - langsmith - feedback-loop - trace-listener - alma - observatory - dgm-fitness related_issues: [] resolution_time: ~2 hours (spec + Phase 1 infrastructure) author: claude-opus-4-6 --- # Unifying Five Disconnected Evaluation Subsystems via LangSmith ## Problem The Thoughtbox project had five evaluation subsystems built independently with no integration layer: | Subsystem | What It Did | What It Didn't Do | |-----------|------------|-------------------| | **Observatory** | Emits events, displays dashboard | Events not persisted for analysis | | **Benchmark Harness** | Runs tests, produces scores | Scores not fed back to DGM archive | | **DGM Fitness** | Archive schema defined | All fitness scores hardcoded to 0.0 | | **`.eval/baselines.json`** | Template metrics defined | All `sample_size: 0` | | **EvaluationGatekeeper** | Type-safe gate interface | Pass-through stub (always passes) | **Symptom**: No closed feedback loop. ALMA-style memory design meta-learning impossible. **Evidence**: - `EvaluationGatekeeper.runTieredEvaluation()` contained: `"TieredEvaluator integration pending - using pass-through"` - `.eval/baselines.json` had every metric at `sample_size: 0` - DGM archive entries all showed `fitness: 0.0` ## Root Cause No integration layer existed between subsystems. Each was built to its own interface with no mechanism to: - Persist evaluation events for downstream analysis - Feed benchmark scores back to the DGM archive - Populate baseline metrics with real data - Gate deployments on actual evaluation results The subsystems were designed in isolation during separate development sessions without a unifying specification. ## Solution ### Approach: Spec-First, Then Build 1. Wrote `SPEC-EVAL-001` defining a 5-layer architecture before any code 2. Implemented Phase 1 (trace infrastructure) to validate the design 3. Preserved all existing code — added listeners, not replacements ### Architecture: 5-Layer Evaluation System ``` Agent Session (ThoughtEmitter events) | v Layer 1: LangSmithTraceListener (ThoughtEmitter -> LangSmith runs) | v Layer 2: Evaluation Datasets (ALMA collection/deployment task sets) | v Layer 3: Custom Evaluators (session quality, memory quality, DGM fitness) | v Layer 4: Experiment Runner (compare configs, feed winners to QD archive) | v Layer 5: Online Monitoring (regression detection, cost budgets, alerts) ``` ### Phase 1 Implementation (what was built) **`src/evaluation/types.ts`** (222 lines) — Shared types for all 5 layers: - `LangSmithConfig` — connection configuration - `SessionRun` — maps Thoughtbox sessions to LangSmith runs - `CollectionTask` / `DeploymentTask` — ALMA-style dataset types - `EvaluatorResult` — evaluator output contract - `ExperimentConfig` / `ExperimentResult` — experiment runner types - `MemoryDesignArchiveEntry` — DGM quality-diversity archive bridge - `MonitoringAlert` — Layer 5 alert types **`src/evaluation/langsmith-config.ts`** — Config loader: - Reads `LANGSMITH_API_KEY`, `LANGSMITH_ENDPOINT`, `LANGSMITH_PROJECT`, `LANGSMITH_WORKSPACE_ID` - Returns `null` if API key not set (graceful degradation) **`src/evaluation/trace-listener.ts`** — Core trace bridge: - `LangSmithTraceListener` subscribes to ThoughtEmitter singleton - Creates parent LangSmith run per session (`run_type: "chain"`) - Creates child runs per thought/revision/branch (`run_type: "tool"`) - All API calls fire-and-forget via `safeAsync()` wrapper - Idempotent attach/detach, 5-second delayed cleanup on session end **`src/observatory/emitter.ts`** — New event type: - Added `monitoring:alert` to `ThoughtEmitterEvents` union - Added `emitMonitoringAlert()` method following existing pattern **`src/index.ts`** — Server wiring: - `initEvaluation()` called in both `startHttpServer()` and `runStdioServer()` - Returns `null` (no-op) when `LANGSMITH_API_KEY` not set ## Key Technical Decisions ### 1. `CreateRunParams` vs `RunUpdate` Tags The LangSmith SDK's `CreateRunParams` does NOT include a `tags` field, while `RunUpdate` does. Solution: - **Run creation**: Tags stored in `extra.metadata.tags` - **Run updates**: Tags used directly via `RunUpdate.tags` ### 2. Fire-and-Forget Pattern All LangSmith API calls are non-blocking: ```typescript private safeAsync(fn: () => Promise<unknown>): void { fn().catch((err) => { console.warn("[Evaluation] LangSmith trace error:", err instanceof Error ? err.message : err); }); } ``` This ensures Observatory events are never delayed by network I/O and LangSmith outages don't crash the server. ### 3. Graceful Degradation ```typescript export function initEvaluation(): LangSmithTraceListener | null { const config = loadLangSmithConfig(); if (!config) { console.error("[Evaluation] LangSmith not configured. Tracing disabled."); return null; } // ... } ``` Without `LANGSMITH_API_KEY`, the system behaves identically to before this change. ## Prevention Strategies ### Avoid Building Disconnected Subsystems 1. **Spec-first for cross-cutting concerns** — Write the integration spec before any component. If the spec doesn't describe data flow between subsystems, the integration won't happen. 2. **Integration smoke tests** — Test the feedback loop as a closed system: - Emit session event -> verify trace in LangSmith - Run benchmark -> verify fitness score updates - Check gatekeeper -> verify it blocks/passes based on real data 3. **Red flags to watch for**: - "We'll integrate later" -> Will not happen - Subsystem doesn't mention how it reads from other subsystems -> Isolated - No integration test examples in the spec -> Integration is optional ### Future Risks | Risk | Mitigation | |------|-----------| | Phases 2-4 not built | SPEC-EVAL-001 has phased plan; track in `dgm-specs/implementation-status.json` | | LangSmith API breaking changes | Pin SDK version; add weekly compat test | | Trace emission performance overhead | Fire-and-forget pattern; benchmark before Phase 2 | ### Test Cases for Verification 1. **Trace sync**: Create session, verify LangSmith run exists with correct parent-child hierarchy 2. **Graceful degradation**: Unset `LANGSMITH_API_KEY`, verify server starts and operates normally 3. **Monitoring alerts**: Emit `monitoring:alert` event, verify listeners receive it 4. **Performance**: Measure overhead of trace listener attachment (<5% threshold) ## Related Documentation ### Specifications - [SPEC-EVAL-001](.specs/SPEC-EVAL-001-unified-evaluation-system.md) — Full 5-layer architecture - [SPEC-SIL-000](.specs/self-improvement-loop/SPEC-SIL-000-feedback-loop-validation.md) — Feedback loop validation - [SPEC-SIL-004](.specs/self-improvement-loop/SPEC-SIL-004-tiered-evaluator.md) — Tiered evaluator ### Implementation - `src/evaluation/` — New evaluation module (Phase 1) - `src/observatory/evaluation-gatekeeper.ts` — Gate interface (stub, to be wired in Phase 3) - `benchmarks/tiered-evaluator.ts` — Tiered evaluation pipeline - `.eval/baselines.json` — Baseline metrics (sample_size: 0, to be populated in Phase 4) ### Tracking - `dgm-specs/implementation-status.json` — EVAL-001 status: `in_progress` (Phase 1 complete) ## ALMA Mapping | ALMA Concept | Our Implementation | |---|---| | Memory Design Archive | `.dgm/` + LangSmith Experiments | | Performance Score | Custom evaluators (session + memory quality) | | Collection Phase | LangSmith dataset: tasks WITHOUT memory | | Deployment Phase | LangSmith dataset: tasks WITH memory | | Meta Agent reflection | LangSmith comparison view + eval log access |

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Kastalien-Research/thoughtbox'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

unify-evaluation-subsystems-langsmith.md•7.87 KiB