Skip to main content
Glama

srefix-diagnosis

GitHub stars License: Apache 2.0 Agents Python 3.10+ Works with Claude Code

Diagnose any tech stack with Claude. 256 specialized SRE agents for Postgres, Kafka, Kubernetes, Istio, … — local, MCP-based, Apache 2.0 licensed.

Claude diagnosing an Nginx 502 incident via srefix-diagnosis MCPs

🌐 Live · srefix.com  ·  Quick Start  ·  🏗️ Architecture


30-second demo: Claude (headless) diagnoses an Nginx 502 incident by orchestrating 5 MCPs — diag-nginx (manual lookup), srefix-explorer (fallback plan), srefix-mock-telemetry (canned Prom/Loki/jumphost). Reasoning is real; only the telemetry I/O is mocked. Reproduce with python3 demo/run_benchmark.py.

Content provenance: The diagnostic content under agents/*.md was SYNTHESIZED BY LARGE LANGUAGE MODELS (Claude family by Anthropic and/or GPT family by OpenAI), then validated against publicly-available open-source documentation. No proprietary, internal, or non-public material was used. See NOTICE for the list of public sources whose documentation informed synthesis, and ISSUES.md to report any copyright concerns.

Verify coverage caveat: only 3 of 250 manuals (nginx, prometheus, vitess) currently ship with metric-name whitelists, against which a manual's PromQL is verified. The other 247 manuals are LLM-synthesized and unverified — Claude may reference metric names that don't exist in your environment. Run srefix-verify-corpus before trusting the manuals on a tech that matters. Adding a whitelist takes ~10 min per tech; PRs welcome.

What's in this repo

Directory

What it is

agents/

250 markdown diagnosis manuals

mcp/

One Python package with 250 entry-point launchers — each is its own MCP server (srefix-diag-postgres / srefix-diag-hbase / ...)

discovery-mcp/

Auto-discovery MCP (33 adapters across 5 layers — cloud API / service registry / cluster bootstrap / k8s / VM tag). Coverage of any specific tech depends on its deployment shape; see "Coverage caveats" below.

prometheus-mcp/

PromQL query MCP

loki-mcp/

LogQL query MCP

es-mcp/

Elasticsearch / OpenSearch query MCP

jumphost-mcp/

SSH-via-jumphost executor with safety gating

explorer-mcp/

Tier-2/3 fallback exploration when manuals miss + cross-tech dependency fan-out

verify-mcp/

Metric-name verifier — diffs manuals against real-exporter whitelists. Run this first. See "Verify accuracy" below.

At a glance

256 MCP servers, ~1799 tools, 33 discovery adapters across 5 layers. Whether a given tech is actually auto-discoverable depends on how you deployed it — managed cloud service vs registered in Consul/ZK vs k8s workload vs tagged VM vs reachable bootstrap host. Techs that don't fit any path (local CLI tools / abstract concepts) are surfaced by a placeholder virtual adapter — listed, not "discovered." See "Coverage caveats" before relying on this.

5-scenario benchmark (real Claude, mocked telemetry)

Real Claude reasoning over 5 SRE incident scenarios. The mock-telemetry MCP serves canned data per scenario so the demo is reproducible without a production environment. Reasoning is genuine — only the I/O is mocked.

Scenario

Difficulty

Pass

Duration

Keywords matched

Nginx 502 Bad Gateway — Upstream Timeout

Basic

57.2s

4/4 (100%)

MongoDB Replica Set Election Storm

Advanced

44.2s

4/6 (67%)

etcd Disk I/O Latency Degrading Kubernetes

Advanced

36.0s

5/5 (100%)

CoreDNS Failure — Cluster-wide DNS Outage

Advanced

90.9s

5/5 (100%)

Cassandra GC Pause Storm

Intermediate

100.0s

4/5 (80%)

Pass rate: 5/5 (100%) · avg 65.7s — reproduce with python3 demo/run_benchmark.py; per-scenario detail in demo/benchmark_report.json. Pass criterion: keyword overlap with the scenario's expected diagnosis ≥ min_confidence (0.5–0.8 per scenario, see scenarios.json). Keyword sets include reasonable domain synonyms (e.g., rollback and secondary alongside stepdown/replica for the MongoDB scenario) so a correct diagnosis using equivalent vocabulary still scores. Numbers fluctuate ±10s run-to-run from non-determinism in Claude's tool-use loop.

Layer

MCP servers

Tools

Knowledge — 250 diag-{tech}

250

1750 (7 each)

Telemetry — srefix-prom / srefix-loki / srefix-es

3

22

Execution — srefix-jumphost

1

5–6

Discovery — srefix-discovery (33 adapters)

1

6

Reasoning fallback — srefix-explorer

1

8

Total

256

~1799

Architecture

flowchart TB
    LLM([LLM / Claude])

    subgraph KNOW["Knowledge — 250 diag-tech MCPs"]
        DIAG["diag-postgres / diag-hbase / diag-kafka / ... / diag-cloudflare"]
    end

    subgraph META["Reasoning Layer"]
        EX["srefix-explorer<br/>fallback plan + dependency graph"]
    end

    subgraph DISCOVERY["srefix-discovery — 33 adapters across 5 layers (coverage varies by deployment)"]
        D_BASIC["opscloud4 / kubernetes / zookeeper"]
        D_CLOUD["aws / gcp / azure / digitalocean"]
        D_CHINA["aliyun / tencentcloud / huaweicloud / jdcloud / volcengine"]
        D_PAAS["vercel / flyio / railway / heroku"]
        D_REG["consul / eureka / nacos / backstage"]
        D_DIRECT["redis-cluster / mongo / cassandra / es-direct / etcd"]
        D_SAAS["SaaS 14: cloudflare / datadog / sentry / snowflake / planetscale / auth0 / okta / netlify / pagerduty / opsgenie / github-actions / gitlab-ci / newrelic / splunk"]
        D_RUN["nomad / rancher / helm / zabbix / nagios / openfaas / knative"]
        D_VIRT["virtual: meta-agents + local-tools + abstract concepts"]
    end

    subgraph EXEC["Telemetry + Execution"]
        PROM["srefix-prom<br/>PromQL"]
        LOKI["srefix-loki<br/>LogQL"]
        ES["srefix-es<br/>ES/OpenSearch"]
        JH["srefix-jumphost<br/>SSH via bastion"]
    end

    LLM --> DIAG
    LLM --> EX
    LLM --> DISCOVERY
    LLM --> EXEC
    EX -.suggested calls.-> EXEC
    DIAG -.extract_diagnostic_queries.-> EXEC

Discovery layers + env-var matrix

flowchart LR
    subgraph BASIC["Basic 3"]
        OPSCLOUD4["OPSCLOUD4_BASE_URL +<br/>OPSCLOUD4_TOKEN"]
        K8S["K8S_DISCOVERY_ENABLED=1"]
        ZK["ZK_QUORUMS"]
    end

    subgraph CLOUD_W["Western cloud"]
        AWS["AWS_DISCOVERY_REGIONS"]
        GCP["GCP_PROJECTS"]
        AZ["AZURE_SUBSCRIPTION_IDS"]
        DO["DIGITALOCEAN_TOKEN"]
    end

    subgraph CLOUD_CN["China cloud"]
        ALI["ALIBABA_CLOUD_ACCESS_KEY_ID"]
        TC["TENCENTCLOUD_SECRET_ID"]
        HW["HUAWEICLOUD_ACCESS_KEY"]
        JD["JDCLOUD_ACCESS_KEY"]
        VOLC["VOLCENGINE_ACCESS_KEY"]
    end

    subgraph PAAS["PaaS"]
        VC["VERCEL_TOKEN"]
        FLY["FLY_API_TOKEN"]
        RW["RAILWAY_TOKEN"]
        HK["HEROKU_API_TOKEN"]
    end

    subgraph REG["Service registries"]
        CON["CONSUL_URL"]
        EUR["EUREKA_URL"]
        NAC["NACOS_URL"]
        BS["BACKSTAGE_URL"]
    end

    subgraph DIRECT["Self-as-registry"]
        RC["REDIS_CLUSTERS"]
        MGO["MONGODB_CLUSTERS"]
        CAS["CASSANDRA_CLUSTERS"]
        ESD["ES_DISCOVERY_ENDPOINTS"]
        ETD["ETCD_CLUSTERS"]
    end

    subgraph SAAS["SaaS — 14 adapters"]
        CF["CLOUDFLARE_API_TOKEN"]
        DD["DD_API_KEY + DD_APP_KEY"]
        SE["SENTRY_AUTH_TOKEN + SENTRY_ORG"]
        SF["SNOWFLAKE_ACCOUNT/USER/PASSWORD"]
        PS["PLANETSCALE_TOKEN + ORG"]
        AU["AUTH0_DOMAIN + AUTH0_MGMT_TOKEN"]
        OK["OKTA_DOMAIN + OKTA_API_TOKEN"]
        NL["NETLIFY_TOKEN"]
        PD["PAGERDUTY_TOKEN"]
        OG["OPSGENIE_API_KEY"]
        GH["GITHUB_TOKEN + GITHUB_ORG"]
        GL["GITLAB_TOKEN"]
        NR["NEW_RELIC_API_KEY"]
        SP["SPLUNK_URL + SPLUNK_TOKEN"]
    end

    subgraph RUN["Orchestrator + monitoring servers"]
        NM["NOMAD_ADDR"]
        RN["RANCHER_URL + RANCHER_TOKEN"]
        HE["HELM_DISCOVERY_ENABLED=1"]
        ZB["ZABBIX_URL + ZABBIX_API_TOKEN"]
        NG["NAGIOS_URL"]
        OF["OPENFAAS_URL"]
        KN["KNATIVE_DISCOVERY_ENABLED=1"]
    end

    subgraph VIRT["Always-on virtual"]
        VRT["(no env required — covers 32 meta + tools + abstract)"]
    end

Each env block is independently opt-in. Setting nothing leaves only the virtual adapter active (which still surfaces the 32 meta/tools/abstract techs).

Coverage caveats (read this before trusting "covered")

Discovery is seeded, not scanned. Whether a given tech in agents/ is actually surfaced by srefix-discovery depends on the deployment shape:

If the tech is…

…it's discoverable via

A managed cloud service (RDS / ElastiCache / MSK / Aliyun RDS / Tencent Redis / etc.)

Layer 1 — cloud API

Self-deployed on tagged VMs (Service=hbase)

Layer 2 — VM tag classification

Registered in Consul / Nacos / Eureka / ZooKeeper

Layer 3 — service registry

Reachable from a bootstrap host (Redis / Mongo / Cassandra / ES / etcd)

Layer 4 — cluster bootstrap

Running as a k8s StatefulSet/Deployment

Layer 5 — kubeconfig walk

Things to know:

  • Big-data stack (Spark / Hadoop / Hive / Trino) maps to Layer 1 (only if you're on EMR / DataProc / HuaweiCloud MRS) or Layer 2 (tagged EC2 / GCE) or Layer 5 (k8s). Self-deployed Hive on bare metal that doesn't register anywhere → not auto-discoverable without contributing a custom seed.

  • Local CLI tools, abstract concepts, and meta-agents (~32 techs: git / docker-compose / linux-perf / etc.) are surfaced by the virtual adapter, which does no actual discovery — it returns a placeholder cluster so the diag-{tech} MCP can still be invoked. Don't read these as "discovered."

  • ZooKeeper adapter ships dispatchers for HBase / Kafka / Solr / HDFS HA (the last surfaces the active + standby NameNodes by parsing the ActiveNodeInfo protobuf in /hadoop-ha/<nameservice>/ActiveStandbyElectorLock; DataNodes are not in ZK — query the NN admin API to enumerate those).

  • Network adapters that can't be tested without a real cluster (CN cloud SDKs, JD/Volcengine especially) are present but uncovered by the test suite — version drift in their SDKs may break them silently.

  • The vtgate_queries_error-class problem (LLM hallucination in static content) doesn't apply here — discovery code is hand-written and uses real client libraries — but feature-completeness claims should be treated as "implemented" not "battle-tested in production."

Self-deployed tech on raw cloud VMs (tag-based classification)

Cloud APIs only see "an EC2 instance" — they don't know the box is running HBase. To bridge this, every cloud adapter accepts a tag-based classifier that groups VMs into typed clusters matching agents/<tech>.md. Convention:

Service=hbase                  ← matches diag-hbase.md (tech short-name)
ClusterName=hbase-prod-us-east ← members with same value form one cluster
Role=master | regionserver     ← optional; surfaced as Host.role

Supported across AWS / GCP / Azure / DigitalOcean / Aliyun / Tencent Cloud / Huawei Cloud / JD Cloud / Volcengine — each with the right SDK-shape tag normalizer. The set of valid Service values is auto-loaded from agents/*.md, so all 250 techs are recognized; adding a new manual extends the matcher with no code change.

Example for AWS:

from srefix_discovery_mcp.adapters.aws_extended import build_aws_ec2_classified

instances = ec2.describe_instances()['Reservations'][...]['Instances']  # raw boto3
clusters = build_aws_ec2_classified(instances, region='us-east-1', account='123')
# → 1 hbase cluster (3 nodes, roles preserved) + 1 kafka cluster + N untagged

Per-cloud entry points:

Cloud

Function

Tag shape

AWS

build_aws_ec2_classified

[{Key, Value}, ...]

GCP

build_gce_instances_classified

{key: value} (lowercase)

Azure

build_azure_vms_classified

{key: value}

DigitalOcean

build_do_droplets_classified

["service:hbase", ...]

Aliyun

build_aliyun_ecs_classified

{Tag: [{TagKey, TagValue}]}

Tencent Cloud

build_tc_cvm_classified

[{Key, Value}, ...]

Huawei Cloud

build_hw_ecs_classified

[{key, value}, ...]

JD Cloud

build_jd_vm_classified

[{Key, Value}, ...]

Volcengine

build_volc_ecs_classified

[{Key, Value}, ...]

Untagged instances fall through to per-cloud defaults (ec2 / gce / azure-vm / droplet / cvm / vm) so nothing is dropped silently.

For self-deployed HBase / Kafka on raw VMs, the fastest path is still ZooKeeper (ZK_QUORUMS=…) — it works without any tagging. Tag-based classification is the right answer when you want to discover arbitrary self-deployed tech (Cassandra, ClickHouse, Redis, MongoDB, custom apps) that doesn't necessarily register in ZK.

Requirements

  • Python ≥ 3.10 (the mcp package requires it)

  • macOS / Linux

  • For discovery-mcp[kubernetes]: a kubeconfig

  • For discovery-mcp[zookeeper]: network access to your ZK quorums

  • For jumphost-mcp: working ~/.ssh/config with ProxyJump entries + ssh-agent loaded

The 250 manuals under agents/ were LLM-synthesized. We audited 60 of them manually (Phase 2 + 3, ~819 fixes — see PHASE2_AUDIT.md and PHASE3_AUDIT.md); the remaining 190 are unverified and may still contain hallucinated metric names, deprecated CLI flags, or version-drifted config keys. Before you install and trust this corpus, run the metric verifier.

verify-mcp ships per-tech whitelists captured from real exporter /metrics output and diffs them against every metric reference in the manuals. Anything in a manual that doesn't exist in the real exporter is flagged as a likely hallucination.

# 1. Install the verifier (small — no telemetry deps)
cd srefix-diagnosis/verify-mcp
pip install -e .

# 2. Run the audit on the whole agents/ corpus
srefix-verify-corpus srefix-diagnosis/agents

# Or just one tech
srefix-verify-corpus --tech vitess srefix-diagnosis/agents

Sample output:

Manuals scanned: 250
  with whitelist:      3  (nginx, prometheus, vitess)
  without whitelist: 247  (not yet covered)

In covered manuals: 58 metric references
  matched whitelist (likely real): 16
  flagged (likely hallucinated):   42

── Flagged metrics by manual ──
  [vitess]  19 flagged
    × vtgate_queries_error  ×6  L155,242,278,426,429,…
    × vtgate_queries_processed_total  ×4  L167,1252,1271,1326
    ...

vtgate_queries_error is a representative case: the LLM invented this name and used it ~21 times in the Vitess manual; the real metric is vtgate_api_error_counts. The verifier surfaces every such pattern with file/line references so you (or a script) can patch them en masse.

Adding a whitelist for a new tech

The current ship covers only 3 of 250 techs — community contributions are how we close the gap. To add a whitelist:

  1. Boot the tech's exporter (locally or in CI), capture /metrics:

    curl http://<exporter>:<port>/metrics \
      | grep '^# HELP' | awk '{print $3}' | sort -u > metric_names.txt
  2. Save as verify-mcp/verify_mcp/whitelists/<tech>.json with the format used by vitess.json (include source, captured_at, method).

  3. Re-run srefix-verify-corpus --tech <tech> to see what gets flagged.

  4. PR welcome.

For techs without a Prometheus exporter (CLI-only, Java JMX-only, etc.), a CLI-flag verifier is on the roadmap — see ISSUES.md.

Fixing what the verifier finds — propose / apply split

Once the verifier surfaces flagged metrics, the question is "what should they be replaced with?" The srefix-fix tool keeps the LLM that proposes a fix strictly separated from the deterministic tool that applies it — so an LLM mistake can never silently reach agents/.

Stage

Tool

Trust

Propose

srefix-fix propose <tech> — spawns claude --print with read-only --allowedTools (Read / Bash(grep) / WebFetch). Cross-checks the manual against the real exporter source and emits a YAML draft.

LOW (LLM may hallucinate)

Review

Human edits the YAML — drop unsure entries, correct new if proposer got it wrong, add confirmed_by: <name> to entries they trust

gate

Apply

srefix-fix apply <yaml> — pure regex sed, word-boundary safe, no LLM at runtime. Same input, same output, every CI run.

HIGH

# 1. Auto-draft (read-only Claude run against the Vitess manual + source)
srefix-fix propose vitess
# → fix_maps/vitess.draft.yaml

# 2. Review (drop unsure entries, add confirmed_by)
$EDITOR fix_maps/vitess.draft.yaml
mv fix_maps/vitess.draft.yaml fix_maps/vitess.yaml
git add fix_maps/vitess.yaml

# 3. Dry-run, then real apply
srefix-fix apply fix_maps/vitess.yaml --dry-run
srefix-fix apply fix_maps/vitess.yaml

# 4. Inspect + commit
git diff agents/vitess-agent.md
git commit -am "fix(vitess): apply confirmed metric-name corrections"

See fix_maps/README.md for the full schema and fix_maps/_example.yaml for a worked Vitess example (the vtgate_queries_error → vtgate_api_error_counts headline fix).

Why this design works for batch audits:

  1. claude --print headless lets the propose stage run unattended overnight across all 247 uncovered techs.

  2. --allowedTools whitelist keeps Claude read-only — no possibility of it editing a manual directly.

  3. YAML in git makes "LLM suggestion" and "human decision" separately reviewable and auditable.

  4. The applier is pure sed. Run it 1000 times in CI; same diff every time.

  5. --dry-run + git diff makes every actual change to agents/ reviewable.

If you remember one thing: proposer and applier MUST be separated, with a human-reviewed YAML between them. That separation is the only thing guaranteeing "an LLM mistake doesn't pollute source-of-truth content."

Use as an MCP tool

The verifier is also exposed as an MCP server (srefix-verify) so Claude can self-check a manual before trusting it during diagnosis:

{
  "mcpServers": {
    "srefix-verify": { "command": "srefix-verify" }
  }
}

Tools: verify_manual, audit_corpus, list_whitelisted_techs, whitelist_info.

Quick Start

Requirements: Python 3.10+, the claude CLI (Claude Code) on PATH. If your system Python is 3.9 or older, create a fresh env first:

conda create -n srefix python=3.11 -y && conda activate srefix
# or: python3.11 -m venv .venv && source .venv/bin/activate

1. The 250 diagnosis MCPs (mcp/)

By default, pip install -e . would install all 250 commands. You almost never want this — too heavy for Claude. Pick a subset first.

cd srefix-diagnosis/mcp

# See what's available
python3 generate.py --list-all
python3 generate.py --list-all | grep -i sql

# Pick one of three filter modes:

# (a) explicit list
python3 generate.py --techs postgres redis kafka hbase k8s

# (b) from a file (recommended — you can git-commit the file)
cat > my_stack.txt <<'EOF'
postgres
redis
kafka
hbase
k8s
prometheus
nginx
istio
EOF
python3 generate.py --from-file my_stack.txt

# (c) regex
python3 generate.py --regex 'postgres|mysql|redis|kafka'

# Then install — only the selected commands appear on PATH
pip install -e .

Verify:

which srefix-diag-postgres   # should print a path
srefix-diag-postgres --help  # MCP servers don't have --help; if it starts and waits on stdio, it works

To add or remove a tech later: edit my_stack.txt, re-run python3 generate.py --from-file my_stack.txt, re-run pip install -e ..

To install all 250 anyway (not recommended in Claude config, OK on PATH):

python3 generate.py     # no filter = all 250
pip install -e .

2. Discovery MCP (cluster auto-discovery)

cd srefix-diagnosis/discovery-mcp
pip install -e ".[all]"   # includes kubernetes + kazoo extras
# Or pick one:
#   pip install -e ".[kubernetes]"
#   pip install -e ".[zookeeper]"
#   pip install -e .             # opscloud4-only

Adapters auto-register based on env vars:

# opscloud4 (any one of: REST CMDB)
export OPSCLOUD4_BASE_URL=https://opscloud.your-corp.com
export OPSCLOUD4_TOKEN=<x-token>

# kubernetes (uses ambient kubeconfig)
export K8S_DISCOVERY_ENABLED=1
export K8S_CONTEXTS=prod-east,prod-west       # optional

# zookeeper (HBase / Kafka legacy / Solr)
export ZK_QUORUMS="zk-prod-east=zk1:2181,zk2:2181,zk3:2181;zk-prod-west=zkw1:2181"
export ZK_WATCHES=hbase,kafka,solr            # optional, default = all

export DISCOVERY_CACHE_TTL=300                # optional, default 300s

Run: srefix-discovery

3. Prometheus MCP

cd srefix-diagnosis/prometheus-mcp
pip install -e .

export PROMETHEUS_URL=http://prometheus.prod:9090
# Optional auth:
#   PROMETHEUS_TOKEN=<bearer>
#   PROMETHEUS_USERNAME=... PROMETHEUS_PASSWORD=...
#   PROMETHEUS_VERIFY_TLS=0   (skip TLS verify)
#   PROMETHEUS_TIMEOUT=30

Run: srefix-prom

4. Loki MCP

cd srefix-diagnosis/loki-mcp
pip install -e .

export LOKI_URL=http://loki.prod:3100
# Optional:
#   LOKI_TOKEN=<bearer>           or LOKI_USERNAME / LOKI_PASSWORD
#   LOKI_ORG_ID=<tenant>          (multi-tenant Loki)
#   LOKI_TIMEOUT=30
#   LOKI_VERIFY_TLS=0

Run: srefix-loki

5. Elasticsearch / OpenSearch MCP

cd srefix-diagnosis/es-mcp
pip install -e .

export ES_URL=https://es.prod:9200
# Auth (one of):
#   ES_API_KEY=<id:secret base64>
#   ES_USERNAME=... ES_PASSWORD=...
# Optional:
#   ES_VERIFY_TLS=0
#   ES_TIMEOUT=30

Run: srefix-es

6. Jumphost MCP (SSH via bastion)

cd srefix-diagnosis/jumphost-mcp
pip install -e .

Set up two YAML config files:

mkdir -p ~/.config/srefix

# Inventory — which hosts exist + their tags
cat > ~/.config/srefix/inventory.yaml <<'EOF'
hosts:
  pg-prod-1:
    tags: {env: prod, tech: postgres, role: primary}
  pg-prod-2:
    tags: {env: prod, tech: postgres, role: replica}
  hbase-rs-001:
    tags: {env: prod, tech: hbase, role: regionserver}
EOF

# Presets — read-only commands the LLM is allowed to invoke
cat > ~/.config/srefix/presets.yaml <<'EOF'
postgres:
  pg-replication-status:
    description: Replication status from primary
    command: 'psql -At -c "SELECT pid, state, sent_lsn FROM pg_stat_replication"'
    allowed_roles: [primary]
    timeout: 10
  pg-table-size:
    description: Size of one table
    command: 'psql -At -c "SELECT pg_total_relation_size(''{table_name}'')"'
    allowed_roles: [primary, replica]
    allowed_args: [table_name]
    timeout: 10
EOF

Make sure ~/.ssh/config has ProxyJump set up for each host:

Host bastion
  HostName bastion.your-corp.com
  User you

Host pg-prod-*
  ProxyJump bastion
  User postgres

Env:

export JUMPHOST_INVENTORY=~/.config/srefix/inventory.yaml
export JUMPHOST_PRESETS=~/.config/srefix/presets.yaml
export JUMPHOST_MODE=preset_only       # default; safest
# Other modes:
#   JUMPHOST_MODE=filtered_arbitrary    # `run` enabled, denylist applied
#   JUMPHOST_MODE=unrestricted          # `run` enabled, no filter (use with external approval)
export JUMPHOST_DRY_RUN=1              # never actually exec; great for testing
export JUMPHOST_DEFAULT_TIMEOUT=30

Run: srefix-jumphost

7. Explorer MCP (Tier-2/3 fallback)

cd srefix-diagnosis/explorer-mcp
pip install -e .

No env required. 8 tools:

  • fallback_exploration_plan(symptom, tech?, cluster_id?, host_pattern?) — Tier-2 structured plan

  • free_explore_bootstrap(symptom, tech?, cluster_id?) — Tier-3 schema-aware starter

  • expand_to_dependencies(tech, depth?, observation?) — upstream fan-out via dep graph

  • expand_to_dependents(tech) — blast-radius reverse lookup

  • reflect_on_findings(findings, top_k?) — feed evidence back, get keywords

  • categorize_symptom / list_symptom_categories / list_supported_techs

Run: srefix-explorer

Register with Claude

Add to your Claude MCP config:

  • Claude Desktop: ~/Library/Application Support/Claude/claude_desktop_config.json

  • Claude Code: ~/.config/claude-code/mcp.json or project-level .claude/mcp.json

Skeleton (pick the MCPs you actually want):

{
  "mcpServers": {
    "discovery": {
      "command": "srefix-discovery",
      "env": {
        "OPSCLOUD4_BASE_URL": "https://opscloud.your-corp.com",
        "OPSCLOUD4_TOKEN":    "...",
        "K8S_DISCOVERY_ENABLED": "1",
        "ZK_QUORUMS": "zk-prod-east=zk1:2181,zk2:2181,zk3:2181"
      }
    },
    "prom":     { "command": "srefix-prom",  "env": { "PROMETHEUS_URL": "..." } },
    "loki":     { "command": "srefix-loki",  "env": { "LOKI_URL": "..." } },
    "es":       { "command": "srefix-es",    "env": { "ES_URL": "...", "ES_API_KEY": "..." } },
    "jumphost": {
      "command": "srefix-jumphost",
      "env": {
        "JUMPHOST_INVENTORY": "/Users/you/.config/srefix/inventory.yaml",
        "JUMPHOST_PRESETS":   "/Users/you/.config/srefix/presets.yaml",
        "JUMPHOST_MODE":      "preset_only",
        "JUMPHOST_DRY_RUN":   "1"
      }
    },
    "diag-postgres": { "command": "srefix-diag-postgres" },
    "diag-hbase":    { "command": "srefix-diag-hbase" },
    "diag-redis":    { "command": "srefix-diag-redis" },
    "diag-kafka":    { "command": "srefix-diag-kafka" },
    "diag-k8s":      { "command": "srefix-diag-k8s" }
  }
}

For the diag-* block, generated automatically — copy whichever lines you need from mcp/claude_mcp_config.json. Or filter to a subset:

cd mcp
python3 filter_config.py postgres redis kafka hbase k8s > ~/my_diag_subset.json
# then merge ~/my_diag_subset.json["mcpServers"] into your Claude config

How the diagnose-with-evidence loop works

Observation: "PostgreSQL replica 复制延迟 > 5min"
  │
  ├─ discovery.list_hosts(tech="postgres", role="primary")
  │       → ["pg-prod-1"]
  ├─ diag-postgres.diagnose("replication lag spiking")
  │       → case "ReplicationLagCritical" + Symptoms / Diagnosis / Root Cause Tree / Thresholds
  ├─ diag-postgres.extract_diagnostic_queries("ReplicationLagCritical")
  │       → [4 structured queries: 3 psql + 1 shell, each with suggested_mcp]
  │
  ├─ prom.range_query("pg_replication_lag_seconds{...}", "-30m", "now")
  ├─ jumphost.run_safe(host="pg-prod-1", tech="postgres",
  │                    preset_name="pg-replication-status")
  ├─ loki.query_range('{app="postgres"} |= "wal"', "-30m", "now")
  ├─ es.search("logs-postgres-*", query='level:ERROR AND host:"pg-prod-1"')
  │
  └─ Claude correlates evidence + manual → diagnosis + suggested fixes

Verify install

# Each command should be on PATH:
which srefix-discovery srefix-prom srefix-loki srefix-es srefix-jumphost
which srefix-diag-postgres   # plus whichever diag-* you generated

# Smoke-test one MCP server (will hang on stdin — that's normal; Ctrl-C to exit):
srefix-diag-postgres

# Full inspection via the MCP CLI inspector:
npx @modelcontextprotocol/inspector srefix-diag-postgres

Troubleshooting

Symptom

Cause / Fix

ModuleNotFoundError: No module named 'mcp'

Python < 3.10 or pip install ran with the wrong interpreter. Use python3.10 -m pip install -e .

srefix-diag-foo: command not found

foo.md exists but you didn't include foo in the generate.py filter. Re-run generate.py + pip install -e .

Discovery returns 0 clusters

Missing env vars (no adapter enabled), or auth failure — check discovery_health() output

jumphost.run_safe returns "host not in inventory"

Add the host to your inventory.yaml, or your JUMPHOST_INVENTORY env points at a different file

Claude can't see the MCP

Restart Claude after editing config; Claude only re-reads at startup

License

Apache License 2.0 — see LICENSE for the full text and the "CONTENT PROVENANCE DISCLOSURE" section explaining how the diagnostic content was synthesized.

Apache 2.0 was chosen over MIT because it adds an explicit patent grant, which protects users (and contributors) against patent claims by adopters. See NOTICE for required attribution and ISSUES.md for the copyright takedown procedure.

A
license - permissive license
-
quality - not tested
C
maintenance

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/aisrefix-commits/srefix-diagnosis'

If you have feedback or need assistance with the MCP directory API, please join our Discord server