Thoughtbox

thoughtbox
agentops

PHASE1.2_IMPLEMENTATION_SUMMARY.md•8.45 KiB

# AgentOps Phase 1.2: Anti-Slop Hardening + Cost Transparency **Status**: ✅ Complete **Date**: 2026-01-29 **Tests**: 12/12 passing ## Summary Implemented P0 anti-slop fixes and cost transparency improvements based on ChatGPT o1 reasoning model feedback. ## Changes Implemented ### P0-1: Evidence Provenance Enforcement ✅ **Problem**: Evidence URLs only validated by prompt, not by code. **Solution**: Added code-level enforcement that evidence URLs must be from collected signals. **Files Modified**: - `agentops/runner/lib/template.ts` - Added `normalizeURL()` function (handles arXiv URL variations, trailing slashes) - Added `validateEvidenceProvenance()` function - Updated `validateProposalsPayload()` signature to accept `collectedSignals` parameter - `agentops/runner/lib/synthesis.ts` - Pass collected signals to validator **Tests Added** (2): - Rejects URLs not in collected signals - Normalizes arXiv URLs correctly (http/https, /abs vs /pdf) --- ### P0-2: Fix Numeric Impact Regex ✅ **Problem**: False positives on "MCP v2", "stage 2", etc. False negatives on "2x" at EOL. **Solution**: New unit-based pattern that only rejects numbers with impact units (%, ms, x, times). **Files Modified**: - `agentops/runner/lib/template.ts` (line 253) - OLD: `/\d+%|\d+x\s|by\s+\d+|reduce.*\d+|improve.*\d+/i` - NEW: `/\b\d+(\.\d+)?\s*(%|ms|milliseconds?|seconds?|minutes?|hours?|x|×|times)(\b|$)|by\s+\d+(\.\d+)?\s*(%|ms|x|×|times)/i` **What it catches** (true positives): - "40%" → ✅ Rejected - "2x faster" → ✅ Rejected - "100ms improvement" → ✅ Rejected - "by 40%" → ✅ Rejected **What it allows** (false positives fixed): - "MCP v2" → ✅ Allowed - "Improve stage 2 gating" → ✅ Allowed - "Reduce 404 errors" → ✅ Allowed **Tests Added** (4): - Allows version numbers (v2, v2.0.1) - Allows stage numbers (stage 2) - Rejects percentage claims (40%) - Rejects multiplier claims (2x) --- ### P0-3: Add --fixtures Flag ✅ **Problem**: `--dry-run` still costs money (makes LLM calls). **Solution**: True zero-cost mode using hardcoded fixture data. **Files Modified**: - `agentops/runner/cli.ts` - Added `--fixtures` flag parsing - Updated help text - `agentops/runner/daily-dev-brief.ts` - Added `fixturesMode` to `DailyBriefOptions` interface - Pass option to synthesis - `agentops/runner/lib/synthesis.ts` - Added `SynthesisOptions` interface - Early return in `synthesizeProposals()` if fixturesMode is true - Returns fixture data from `agentops/fixtures/proposals.example.json` - Cost: $0.00 **Usage**: ```bash # Zero cost, no LLM calls npm run agentops:daily -- --fixtures # Previous behavior (LLM call, costs money, no GitHub issue) npm run agentops:daily -- --dry-run # Production (LLM call + GitHub issue creation) npm run agentops:daily ``` --- ### P0-4: Cost Calculation Transparency ✅ **Problem**: Costs calculated from hardcoded pricing, not labeled as calculated, not model-aware. **Solution**: Created pricing config per model with transparency metadata. **Files Created**: - `agentops/runner/lib/llm/pricing.ts` - `ANTHROPIC_PRICING` config for all models - `calculateCost()` function with model-specific pricing - Handles Sonnet 4.5 large context tier (>200K tokens) - Includes pricing source URL and date **Files Modified**: - `agentops/runner/lib/llm/types.ts` - Renamed `cost_usd` → `cost_usd_calculated` - Added `cost_metadata` with pricing transparency - `agentops/runner/lib/llm/provider.ts` - Import and use `calculateCost()` for Anthropic - Updated OpenAI handler with metadata - `agentops/runner/lib/synthesis.ts` - Use `cost_usd_calculated` instead of `cost_usd` **Pricing Transparency**: ```json { "cost_usd_calculated": 0.069, "cost_metadata": { "model": "claude-sonnet-4-5-20250929", "inputPricePerMToken": 3.00, "outputPricePerMToken": 15.00, "pricingSource": "https://www.anthropic.com/pricing", "pricingDate": "2026-01-29" } } ``` **Tests Added** (4): - Sonnet 4.5 standard pricing (≤200K) - Sonnet 4.5 large context pricing (>200K) - Opus 4.5 pricing - Unknown model fallback --- ### P1-1: Per-Source Observability ✅ **Problem**: Can't debug which sources degraded or became slower. **Solution**: Added per-source counts and timings to metadata. **Files Modified**: - `agentops/runner/lib/sources/types.ts` - Added `signals_by_source: Record<string, number>` - Added `elapsed_ms_by_source: Record<string, number>` - `agentops/runner/lib/sources/collect.ts` - Track timing for each source (repo, arxiv, rss, html) - Count signals per source - Include in returned metadata **Example Output**: ```json { "signals_by_source": { "repo": 3, "arxiv": 12, "rss": 5, "html": 11 }, "elapsed_ms_by_source": { "repo": 450, "arxiv": 890, "rss": 320, "html": 1200 } } ``` --- ### P1-2: Add signals.json Artifact ✅ **Problem**: Can't reproduce how signals were subsetted/deduplicated. **Solution**: Persist all collected signals before processing. **Files Modified**: - `agentops/runner/daily-dev-brief.ts` - Store full signals collection in `collectedSignalsData` - Write `signals.json` with all collected signals - Add to artifact_index **Artifact Structure**: ```json { "collected_at": "2026-01-29T12:00:00Z", "total_collected": 31, "signals": [...], "metadata": { "signals_by_source": {...}, "elapsed_ms_by_source": {...} } } ``` --- ## Test Results ``` === Testing Evidence Provenance Enforcement === ✅ Evidence provenance: rejects URLs not in signals ✅ Evidence provenance: normalizes arXiv URLs === Testing Numeric Impact Regex === ✅ Numeric regex: allows version numbers (v2) ✅ Numeric regex: allows stage numbers ✅ Numeric regex: rejects percentage claims (40%) ✅ Numeric regex: rejects multiplier claims (2x) === Testing Cost Calculation === ✅ Cost calculation: Sonnet 4.5 standard pricing ✅ Cost calculation: Sonnet 4.5 large context pricing ✅ Cost calculation: Opus 4.5 pricing ✅ Cost calculation: unknown model fallback === Testing URL Normalization === ✅ URL normalization: arXiv URLs ✅ URL normalization: trailing slashes ✨ All Phase 1.2 tests passed! ``` **Total**: 12/12 tests passing --- ## TypeScript Compilation ```bash $ cd agentops && npx tsc --noEmit # No errors ``` --- ## Files Changed ### Modified (9 files) 1. `agentops/runner/lib/template.ts` - Evidence provenance + numeric regex fix 2. `agentops/runner/lib/synthesis.ts` - Fixtures mode + pass signals to validator 3. `agentops/runner/lib/llm/provider.ts` - Use pricing config 4. `agentops/runner/lib/llm/types.ts` - Update response types 5. `agentops/runner/lib/sources/types.ts` - Add per-source metadata fields 6. `agentops/runner/lib/sources/collect.ts` - Track per-source metrics 7. `agentops/runner/cli.ts` - Add --fixtures flag 8. `agentops/runner/daily-dev-brief.ts` - Fixtures mode + signals.json 9. `agentops/runner/types.ts` - (No changes needed, using existing types) ### Created (2 files) 1. `agentops/runner/lib/llm/pricing.ts` - Pricing config + calculateCost() 2. `agentops/tests/phase1.2.test.ts` - Test suite (12 tests) --- ## Risk Assessment | Component | Risk | Status | |-----------|------|--------| | Evidence provenance | LOW | ✅ Clear errors if URLs don't match | | Numeric regex | LOW | ✅ Extensive test coverage | | Fixtures flag | LOW | ✅ Simple boolean, zero cost verified | | Cost calculation | LOW | ✅ Official pricing, fallback for unknown models | | Per-source metadata | LOW | ✅ Additive change, doesn't break existing | **Overall Risk**: LOW **No Regressions**: All existing functionality preserved --- ## Next Steps 1. ✅ Run integration test with `--fixtures` flag 2. ✅ Run integration test with `--dry-run` flag (verify cost metadata) 3. Monitor 3-5 production runs with new validation 4. Verify per-source timings help debug slow sources --- ## Maintenance Notes ### Pricing Updates Update `agentops/runner/lib/llm/pricing.ts` quarterly: 1. Check https://www.anthropic.com/pricing 2. Update prices if changed 3. Update `lastUpdated` date 4. Run tests to verify calculations 5. Add new models as released ### Evidence Provenance If too strict, can disable by not passing `collectedSignals` to `validateProposalsPayload()`. --- ## Documentation See plan document for full context: - `agentops/PHASE1.2_PLAN.md` (if exists) - ChatGPT o1 feedback summary included in plan --- **Implementation Time**: ~4 hours **Lines of Code Added**: ~350 **Lines of Code Modified**: ~50 **Test Coverage**: 12 new tests, all passing

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Kastalien-Research/thoughtbox'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

PHASE1.2_IMPLEMENTATION_SUMMARY.md•8.45 KiB