Provides a REST API interface for the control system evaluation service, allowing HTTP-based access to controller benchmarking and scoring functionality.
Enables visualization of controller performance metrics and analytics data exported from PID evaluations and comparative controller assessments.
ctrltest-mcp - Flight-control regression lab for MCP agents
TL;DR: Evaluate PID and bio-inspired controllers against analytic or diffSPH/Foam-Agent data through MCP, logging overshoot, energy, and gust metrics automatically.
Table of contents
What it provides
Scenario | Value |
Analytic PID benchmarking | Run closed-form plant models and produce overshoot/settling/energy metrics without manual scripting. |
High-fidelity scoring | Ingest logged data from or diffSPH runs and fuse it into controller evaluations. |
MCP integration | Expose the scoring API via STDIO/HTTP so ToolHive or other clients can automate gain tuning and generate continuous performance scorecards. |
Quickstart
Run a PID evaluation:
Typical outputs (analytic only):
The analytic plant example above clips
settling_timeat the requested simulation duration (duration_s=3.0). Increase the horizon if you need the loop to settle fully before computing that metric.
Run as a service
CLI (STDIO transport)
Use python -m ctrltest_mcp --describe to print basic metadata without starting the server.
FastAPI (REST)
python-sdk tool (STDIO / MCP)
ToolHive smoke test
Run the integration script from your workspace root:
The smoke test runs the analytic path by default. To exercise high-fidelity scoring, stage Foam-Agent archives under logs/foam_agent/ and diffSPH gradients under logs/diffsph/ before launching the script.
Agent playbook
Gust rejection - feed archived diffSPH gradients (
diffsph_metrics) and Foam-Agent archives (paths returned by those services) to quantify adaptive CPG improvements.Controller comparison - log analytics for multiple PID gains, export JSONL evidence, and visualise in Grafana.
Policy evaluation - integrate with RL or evolutionary algorithms; metrics are structured for automated scoring.
Stretch ideas
Extend the adapter for PteraControls (planned once upstream Python bindings are published).
Drive the MCP from
scripts/fitnessto populate nightly scorecards.Combine with
migration-mcpto explore route-specific disturbance budgets.
Accessibility & upkeep
Hero badges include alt text and stay under five to maintain scanability.
Run
uv run pytest(tests mock diffSPH/Foam-Agent inputs and assert deterministic analytic results).Keep metric schema changes documented—downstream dashboards rely on them.
Metric schema at a glance
Field | Units | Notes |
| radians | peak response minus setpoint |
| rad²·s | integral squared error |
| seconds | first time error stays within tolerance |
| milliseconds | detector latency |
| hertz | detector bandwidth |
| 0–1 | fraction of disturbance rejected |
| joules | energy pre-adaptation |
| joules | energy post-adaptation |
| 0–1 | energy reduction ratio |
| unitless | stability margin |
| unitless | cost weight × switches |
| milliseconds | latency budget after switching |
| joules | mix-of-experts energy draw |
| unitless | only when both diffSPH & Foam metrics are present |
| varies | raw diffSPH/Foam metrics merged in |
Example of fused high-fidelity metrics:
Contributing
uv pip install --system -e .[dev]uv run ruff check .anduv run pytestShare sample metrics in PRs so reviewers can sanity-check improvements quickly.
MIT license - see LICENSE.