find_silent_failures
Detect silent failures in scheduled jobs: exit code 0 but empty output, length anomalies, error keywords, or duration anomalies. Surfaces hidden issues in cron, systemd timers, and OpenClaw schedulers.
Instructions
Jobs that returned exit code 0 but output was flagged by silent-fail rules (empty output, length anomaly, error keywords, duration anomaly).
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| window_hours | No | Lookback window in hours (default 24) |
Implementation Reference
- src/silentwatch_mcp/server.py:166-173 (handler)Tool handler for 'find_silent_failures' — reads window_hours argument, fetches all jobs and their runs from the backend, then delegates to build_silent_failure_report() for aggregation.
if name == "find_silent_failures": window_hours = int(arguments.get("window_hours", 24)) jobs = await backend.list_jobs() runs_by_job: dict[str, list[Any]] = {} for j in jobs: runs_by_job[j.id] = await backend.get_job_runs(j.id, limit=200) report = build_silent_failure_report(jobs, runs_by_job, window_hours=window_hours) return _serialize(report) - src/silentwatch_mcp/server.py:102-119 (registration)Tool registration — declares the 'find_silent_failures' tool with its description and input schema (accepts optional window_hours integer).
Tool( name="find_silent_failures", description=( "Jobs that returned exit code 0 but output was flagged by silent-fail " "rules (empty output, length anomaly, error keywords, duration anomaly)." ), inputSchema={ "type": "object", "properties": { "window_hours": { "type": "integer", "description": "Lookback window in hours (default 24)", "default": 24, } }, "required": [], }, ), - Core aggregation logic — build_silent_failure_report() filters runs within the window, identifies silent failures via is_silent_failure(), deduplicates indicators, and produces a SilentFailureReport.
def build_silent_failure_report( jobs: list[CronJob], runs_by_job: dict[str, list[CronRun]], window_hours: int = 24, ) -> SilentFailureReport: """Aggregate silent failures across all jobs within a window.""" cutoff = datetime.now(UTC) - timedelta(hours=window_hours) flagged: list[SilentFailureFlag] = [] for job in jobs: runs = runs_by_job.get(job.id, []) in_window = [r for r in runs if r.started_at >= cutoff] silent = [r for r in in_window if is_silent_failure(r)] if not silent: continue # Deduplicate indicators across silent runs indicators_seen: set[SilentFailIndicator] = set() for r in silent: indicators_seen.update(r.silent_fail_indicators) flagged.append( SilentFailureFlag( job_id=job.id, job_name=job.name, silent_fail_count=len(silent), total_runs=len(in_window), silent_fail_rate=len(silent) / len(in_window) if in_window else 0.0, indicators=sorted(indicators_seen, key=lambda i: i.value), sample_run_ids=[r.run_id for r in silent[:5]], ) ) return SilentFailureReport( window_hours=window_hours, jobs_flagged=flagged, total_jobs_checked=len(jobs), ) - Helper — detect_silent_fail_indicators() checks a single run (exit_code==0) for empty output, error keywords in stdout, output length anomaly, and duration anomaly against historical runs.
def detect_silent_fail_indicators( run: CronRun, historical_runs: list[CronRun] | None = None, ) -> list[SilentFailIndicator]: """Return all silent-fail indicators that fire for this run. A run is a silent failure candidate only if exit_code == 0; if exit_code != 0 it's an explicit failure, not silent. """ if run.exit_code != 0: return [] indicators: list[SilentFailIndicator] = [] output = run.output_snippet or "" # Rule: output empty if not output.strip(): indicators.append(SilentFailIndicator.OUTPUT_EMPTY) # Rule: error keywords in stdout despite exit 0 if DEFAULT_ERROR_KEYWORDS.search(output): indicators.append(SilentFailIndicator.ERROR_KEYWORDS_IN_STDOUT) # Rule: output length anomaly (vs historical median) if historical_runs: successful_lengths = [ len(r.output_snippet) for r in historical_runs if r.status == RunStatus.SUCCESS and r.output_snippet ] if len(successful_lengths) >= 5: baseline = median(successful_lengths) if baseline > 0 and len(output) < baseline * 0.3: indicators.append(SilentFailIndicator.OUTPUT_LENGTH_ANOMALY) # Rule: duration anomaly (vs historical median) if historical_runs and run.duration_ms is not None: successful_durations = [ r.duration_ms for r in historical_runs if r.status == RunStatus.SUCCESS and r.duration_ms is not None ] if len(successful_durations) >= 5: baseline_ms = median(successful_durations) if run.duration_ms < baseline_ms * 0.1: indicators.append(SilentFailIndicator.DURATION_ANOMALY_SHORT) return indicators - src/silentwatch_mcp/types.py:129-134 (schema)Response schema for 'find_silent_failures' — SilentFailureReport contains window_hours, jobs_flagged list (SilentFailureFlag), and total_jobs_checked.
class SilentFailureReport(BaseModel): """Response for `find_silent_failures`.""" window_hours: int jobs_flagged: list[SilentFailureFlag] total_jobs_checked: int