Skip to main content
Glama
CodedVibesX

ophelia-execution-verifier

by CodedVibesX

ophelia-execution-verifier

Independent reference implementation, built for my founding-engineer application to Ophelia. Not affiliated with, endorsed by, or an official product of Ophelia.

Silent-failure rate 16.6% to 2.6% across 10 seeds; raw API success 91.0% vs true outcome 65.6%; recovery dial aggressive 92.0% booked / 4.1% silent vs conservative 61.3% / 2.6%.

A Claude-connector-spec MCP server that books a restaurant, then proves whether the booking actually happened, recovers what it can, and escalates what it can't. It exists to measure and close the gap between "the provider's API said success" and "the reservation is real."

The problem

An AI agent can plan a booking perfectly and still fail the user, because the real world does not behave like a clean API. Providers return "confirmed" when nothing was reserved. A held slot never finalizes. A page changes shape and the agent reads a phantom success. The hard part of real-world execution is not deciding to book the table. It is knowing, afterward, whether the table is actually booked.

The most dangerous version is the silent failure: the API claims success, the agent believes it, the user gets nothing, and nobody finds out until they are standing at the host stand. You cannot prevent every real-world edge case up front, so the job is to make failures bounded, observable, and recoverable instead of silent.

This repo quantifies that gap and ships a layer that closes most of it.

Related MCP server: hiveecho

The result

200 tasks per run, 10 seeds, fully offline and deterministic.

The provider's API looks healthy. The bookings often are not:

metric

rate

raw API "success"

91.0%

true outcome success

65.6%

That 25-point spread is the silent-failure surface, the bookings a naive agent reports as done that never actually happened. The verification layer collapses it:

silent-failure rate

rate (10-seed mean)

before verification

16.6%

after verification

2.6%

After is below before in 10 of 10 seeds. McNemar exact paired test on the per-task silent-failure indicator: p well under 1e-7.

The residual 2.6% is honest. It is the "deep ghost" slice, where every independent signal corroborates a booking that does not exist. No re-read can catch that, and a verification layer that claimed a perfect 0% would be the tell that it is mocked. Confidence scores are measured against ground truth with a calibration curve, not asserted.

The recovery dial

Verification tells you the truth. Recovery decides what to do about it, and that is a product decision, not a default. Retrying a failed booking buys real reservations but re-exposes you to deep ghosts, so every retry trades a little silent-failure risk for a lot of booking success. This repo exposes that as a dial:

policy

real bookings

silent failures

human escalations / run

aggressive

92.0%

4.1%

3.3

conservative

61.3%

2.6%

42.7

  • Aggressive: retry hard, maximize real bookings, accept the silent-failure rise. For a $20 dinner reservation.

  • Conservative: never gamble a retry, hold silent failures at the verification floor, escalate the uncertain ones to a human. For a $500 concert.

The insight the data forced: the only real lever on the silent-failure rate is how hard you retry. So that is the knob, set by the stakes of the action.

How it works

A dual-layer design that mirrors a real execution platform: a core engine plus a thin protocol surface on top.

  • A thin FastMCP server (Streamable HTTP, ready to run as a custom Claude connector) exposes a two-phase, consent-gated booking tool: propose_booking returns a pending action, confirm_booking executes it. The write tool carries destructiveHint so the client shows its own approval prompt before anything happens.

  • A simulated provider injects five real failure modes (clean, confirmed-but-not-real, partial, transient, drift) through fallible read channels, so verification has to work against noise, not a clean oracle.

  • The verification layer reconciles each claimed outcome against independent re-reads (run concurrently) and returns a verified status, a calibrated confidence, and the supporting evidence.

  • The recovery orchestrator applies the stakes policy: bounded retries with backoff, provider reliability scoring, and human escalation, capped at three attempts.

Reproducibility

Everything runs offline against the simulated provider. No real provider, no API keys, ever. Clone it, run the eval, get the same numbers. The headline rates are pinned by tests that fail if the reported table and the underlying contingency math ever drift apart, so the numbers cannot quietly rot.

Rigor

  • Paired significance: McNemar's exact test on the per-task silent-failure indicator, not a comparison of two unpaired percentages.

  • Wilson confidence intervals on every rate.

  • Calibration: confidence is measured against ground truth, not claimed.

  • Fallible verification on purpose: a perfect "after" column would be the signal that the harness is mocked.

  • Adversarial self-review: the layer and its metrics were run through a fleet of reviewers, each hunting a specific failure mode. Findings fixed, zero measurement errors at the end.

What a founding engineer would own here in the first 30 to 60 days

  • Replace the simulated provider with real adapters (Resy, OpenTable, SevenRooms), where the verification channels become real re-reads, webhooks, and confirmation parsing.

  • Turn the reliability scores into a live provider-health system that routes around flaky providers per action type.

  • Make the dial a per-action policy surface that the platform's customers configure by stakes.

  • Build the system of record: persist every outcome with its evidence so the calling agent can reason about what to do next.

What is next, and deliberately not built yet

  • OAuth user-consent flow: the step that turns this from a paste-the-URL custom connector into a directory-grade one. The server is built to the connector spec; the consent flow is the next addition.

  • Observability surface: traces and a live dashboard.

Both were scoped out on purpose, to keep the core honest and tight rather than broad and shallow.

Run it

# install
pip install -e .

# run the tests
pytest

# run the full eval (200 tasks x 10 seeds, offline)
python -m evals.run_eval

# run the MCP server
python -m ophelia_verifier.server

About

Built as a work sample for Ophelia's founding AI/LLM engineer role. The goal was to build the exact kind of thing Ophelia builds, an execution layer an AI agent calls, and to take seriously the part that is actually hard: knowing whether the real-world action happened.

A
license - permissive license
-
quality - not tested
B
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/CodedVibesX/ophelia-execution-verifier'

If you have feedback or need assistance with the MCP directory API, please join our Discord server