ophelia-execution-verifier
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@ophelia-execution-verifierverify my booking at La Maison for tonight 8pm"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
ophelia-execution-verifier
Independent reference implementation, built for my founding-engineer application to Ophelia. Not affiliated with, endorsed by, or an official product of Ophelia.

A Claude-connector-spec MCP server that books a restaurant, then proves whether the booking actually happened, recovers what it can, and escalates what it can't. It exists to measure and close the gap between "the provider's API said success" and "the reservation is real."
The problem
An AI agent can plan a booking perfectly and still fail the user, because the real world does not behave like a clean API. Providers return "confirmed" when nothing was reserved. A held slot never finalizes. A page changes shape and the agent reads a phantom success. The hard part of real-world execution is not deciding to book the table. It is knowing, afterward, whether the table is actually booked.
The most dangerous version is the silent failure: the API claims success, the agent believes it, the user gets nothing, and nobody finds out until they are standing at the host stand. You cannot prevent every real-world edge case up front, so the job is to make failures bounded, observable, and recoverable instead of silent.
This repo quantifies that gap and ships a layer that closes most of it.
Related MCP server: hiveecho
The result
200 tasks per run, 10 seeds, fully offline and deterministic.
The provider's API looks healthy. The bookings often are not:
metric | rate |
raw API "success" | 91.0% |
true outcome success | 65.6% |
That 25-point spread is the silent-failure surface, the bookings a naive agent reports as done that never actually happened. The verification layer collapses it:
silent-failure rate | rate (10-seed mean) |
before verification | 16.6% |
after verification | 2.6% |
After is below before in 10 of 10 seeds. McNemar exact paired test on the per-task silent-failure indicator: p well under 1e-7.
The residual 2.6% is honest. It is the "deep ghost" slice, where every independent signal corroborates a booking that does not exist. No re-read can catch that, and a verification layer that claimed a perfect 0% would be the tell that it is mocked. Confidence scores are measured against ground truth with a calibration curve, not asserted.
The recovery dial
Verification tells you the truth. Recovery decides what to do about it, and that is a product decision, not a default. Retrying a failed booking buys real reservations but re-exposes you to deep ghosts, so every retry trades a little silent-failure risk for a lot of booking success. This repo exposes that as a dial:
policy | real bookings | silent failures | human escalations / run |
aggressive | 92.0% | 4.1% | 3.3 |
conservative | 61.3% | 2.6% | 42.7 |
Aggressive: retry hard, maximize real bookings, accept the silent-failure rise. For a $20 dinner reservation.
Conservative: never gamble a retry, hold silent failures at the verification floor, escalate the uncertain ones to a human. For a $500 concert.
The insight the data forced: the only real lever on the silent-failure rate is how hard you retry. So that is the knob, set by the stakes of the action.
How it works
A dual-layer design that mirrors a real execution platform: a core engine plus a thin protocol surface on top.
A thin FastMCP server (Streamable HTTP, ready to run as a custom Claude connector) exposes a two-phase, consent-gated booking tool:
propose_bookingreturns a pending action,confirm_bookingexecutes it. The write tool carriesdestructiveHintso the client shows its own approval prompt before anything happens.A simulated provider injects five real failure modes (clean, confirmed-but-not-real, partial, transient, drift) through fallible read channels, so verification has to work against noise, not a clean oracle.
The verification layer reconciles each claimed outcome against independent re-reads (run concurrently) and returns a verified status, a calibrated confidence, and the supporting evidence.
The recovery orchestrator applies the stakes policy: bounded retries with backoff, provider reliability scoring, and human escalation, capped at three attempts.
Reproducibility
Everything runs offline against the simulated provider. No real provider, no API keys, ever. Clone it, run the eval, get the same numbers. The headline rates are pinned by tests that fail if the reported table and the underlying contingency math ever drift apart, so the numbers cannot quietly rot.
Rigor
Paired significance: McNemar's exact test on the per-task silent-failure indicator, not a comparison of two unpaired percentages.
Wilson confidence intervals on every rate.
Calibration: confidence is measured against ground truth, not claimed.
Fallible verification on purpose: a perfect "after" column would be the signal that the harness is mocked.
Adversarial self-review: the layer and its metrics were run through a fleet of reviewers, each hunting a specific failure mode. Findings fixed, zero measurement errors at the end.
What a founding engineer would own here in the first 30 to 60 days
Replace the simulated provider with real adapters (Resy, OpenTable, SevenRooms), where the verification channels become real re-reads, webhooks, and confirmation parsing.
Turn the reliability scores into a live provider-health system that routes around flaky providers per action type.
Make the dial a per-action policy surface that the platform's customers configure by stakes.
Build the system of record: persist every outcome with its evidence so the calling agent can reason about what to do next.
What is next, and deliberately not built yet
OAuth user-consent flow: the step that turns this from a paste-the-URL custom connector into a directory-grade one. The server is built to the connector spec; the consent flow is the next addition.
Observability surface: traces and a live dashboard.
Both were scoped out on purpose, to keep the core honest and tight rather than broad and shallow.
Run it
# install
pip install -e .
# run the tests
pytest
# run the full eval (200 tasks x 10 seeds, offline)
python -m evals.run_eval
# run the MCP server
python -m ophelia_verifier.serverAbout
Built as a work sample for Ophelia's founding AI/LLM engineer role. The goal was to build the exact kind of thing Ophelia builds, an execution layer an AI agent calls, and to take seriously the part that is actually hard: knowing whether the real-world action happened.
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/CodedVibesX/ophelia-execution-verifier'
If you have feedback or need assistance with the MCP directory API, please join our Discord server