OnCall Runbook MCP Server

OnCallRunbookMCPServer
specs
001-feature-on-call

spec.md•8.47 KiB

# 功能規格：On-Call Runbook 百科 MCP 伺服器 **功能分支**: `001-feature-on-call` **建立日期**: 2025-10-13 **狀態**: 草稿 **原始輸入描述**: "On-Call Runbook Encyclopedia MCP Server specification" ## 使用者情境與測試（必填） ### 使用者故事 1 - 檢索並獲得安全初步處置 (優先級: P1) 值班工程師於接獲告警後輸入故障症狀關鍵詞，系統於 < 5 秒回傳最相關 runbook 片段、初步診斷 checklist、已標註風險/安全操作指令，以及文件新鮮度提示（若過期 >90 天）。 **優先原因**: 直接降低 MTTA/MTTI，為核心價值；缺此無法構成 MVP。 **獨立測試方式**: 只部署 search / read / checklist / commands / freshness 規則即可端到端驗證（輸入症狀 → 排序結果 + 步驟）。 **驗收情境**: 1. **Given** 100 個有效 runbooks，**When** 輸入含 2~3 關鍵詞症狀，**Then** 回傳 Top K (預設 5) 片段含來源路徑、得分、初步診斷 checklist、安全與風險指令清單。 2. **Given** 某 runbook last_verified_at > 90 天，**When** 該文件出現在結果中，**Then** 回應標示「可能過時」。 3. **Given** 無匹配片段，**When** 查詢，**Then** 回傳「未涵蓋」並顯示 escalation owner。 ### 使用者故事 2 - 規則導向問答與降級 (優先級: P2) 工程師以自然語言提問；系統先檢索與規則聚合。若有 LLM API Key 產出摘要型整合回答（含引用與風險標註），無金鑰則回傳擷取式組裝回答且避免臆測。 **優先原因**: 改善效率與可讀性，但核心可於無 LLM 下運作，列 P2。 **獨立測試方式**: 關閉 LLM 仍有引用式回答；開啟 LLM 增摘要段落且格式一致。 **驗收情境**: 1. **Given** 已設定 API Key，**When** 複合症狀提問，**Then** 回答含 (a) 摘要段落 (b) ≥1 分段引用 (c) 風險指令 ⚠️ (d) 過期提示（若適用）。 2. **Given** 無 API Key，**When** 同題提問，**Then** 回答為擷取組裝，無生成臆測語句，且含引用與風險標註。 3. **Given** 多 runbook meta 衝突，**When** 回答，**Then** 依 (component > service > 最近驗證時間) 選主來源並列出次要來源提示。 ### 使用者故事 3 - 事件交接與事後檢討骨架 (優先級: P3) 事件進行中請求交接時，系統生成 Handoff 區塊（狀態摘要、已執行步驟、剩餘風險、建議下一步）。事件結束後產出 Postmortem 骨架（Timeline 表、行動列表、SEV、Owner、Root Cause 占位）。 **優先原因**: 增進治理與一致性，非即時修復必要，列 P3。 **獨立測試方式**: 僅啟用 handoff / postmortem 工具即可驗證輸出格式，無需 LLM。 **驗收情境**: 1. **Given** 事件摘要 + 指令記錄，**When** 呼叫 handoff，**Then** 產出四主要區塊且字段非空。 2. **Given** 結束事件資訊（開始/結束時間、SEV、主要影響），**When** 呼叫 postmortem，**Then** 產出含 Timeline 表頭與 ≥5 行動項目占位行。 ### 邊界情境與例外 - 全為停用詞查詢 → 回傳引導訊息請改用具體關鍵詞。 - 缺必填 frontmatter → 標記 invalid 排除排名，於 meta.ignore 列出。 - 未收錄服務 → 回覆 UNKNOWN 並提示新增流程（引用 owner_slack）。 - 重複檔案（大小寫差異） → 保留正規化首個，其餘記錄排除原因。 - last_verified_at 格式錯誤 → 略過新鮮度檢查並於 meta.warnings 記錄。 ## 需求（必填） ### 功能性需求 - **FR-001**: 系統必須支援多關鍵詞檢索並回傳 Top K 片段（K 可設定，預設 5），提供片段得分與來源路徑。 - **FR-002**: 系統必須解析並輸出 frontmatter 必填欄位 (title, service, component, severity_default, last_verified_at, owner_slack, owner_team)。 - **FR-003**: 系統必須在回答中標示 last_verified_at 超過 90 天（可設定）的 runbook 為「可能過時」。 - **FR-004**: 系統必須提供 checklist 工具擷取初步診斷與緩解步驟並輸出 safe_ops 聚合。 - **FR-005**: 系統必須提供 commands 工具分離 safe_ops 與 risk_ops 並為 risk_ops 附 ⚠️ 與回退提醒占位。 - **FR-006**: 系統必須依症狀文字與 severity 規則輸出建議 SEV 與通知清單。 - **FR-007**: 系統必須依 (component > service > 最近驗證時間) 規則解決多文件衝突並標示主要來源。 - **FR-008**: 系統必須支援無 LLM 金鑰降級模式仍提供引用與風險標註；有金鑰時加入不超過兩段摘要。 - **FR-009**: 系統必須阻擋路徑跳脫 (path traversal) 請求。 - **FR-010**: 系統必須在無匹配時回傳標準 UNKNOWN/ESCALATE 訊息含 escalation owner。 - **FR-011**: 系統必須記錄 invalid runbooks 並自檢索排除。 - **FR-012**: 系統必須於第一階段即支援中英混合關鍵詞檢索（含基本正規化：大小寫、全半形、常見服務別名對應）。 - **FR-013**: 系統必須產生 handoff 範本含：Status Summary, Executed Steps, Residual Risks, Next Actions。 - **FR-014**: 系統必須產生 postmortem 骨架含：Overview, Impact, Timeline 表頭, Root Cause 占位, Action Items (>=5)。（下方原英文模板與重複需求區塊已刪除以避免歧義；本檔案上方中文區段為唯一權威來源。） - **FR-015**: 系統 MUST 在回答中所有風險指令附加 ⚠️ 與若無回退資訊則提醒人工確認。 - **FR-016**: 系統 MUST 提供 Top K 參數可透過設定調整且驗證其邊界（最小 1、最大 20）。 - **FR-017**: 系統 MUST 於 meta 輸出 warnings 與採用的衝突解決策略摘要。 - **FR-018**: 系統 MUST 為每個回答提供 citations (來源 path + chunk 索引)。 ### Key Entities - **Runbook**: 故障處理文件；屬性：frontmatter 欄位、正文段落、指令區塊、驗證日期。 - **Chunk**: Runbook 中可檢索片段；屬性：文本、來源路徑、分段得分、heading 上下文。 - **CommandExtraction**: 從 chunk 聚合的 safe_ops / risk_ops 集；屬性：分類、原始文本、回退提示。 - **AnswerMeta**: 回答附加資訊；屬性：citations[], warnings[], appliedConflictRule, staleFlag。 ## Success Criteria *(mandatory)* ### Measurable Outcomes - **SC-001**: 80% 以上常見告警（測試樣本集）可在首次查詢獲得包含對應 runbook 的 Top 3 片段。 - **SC-002**: 本機 100 份 runbooks（平均 2KB chunk）單次檢索 + answer 流程 P95 延遲 < 500ms（無 LLM）。 - **SC-003**: 在無 LLM 模式下，回答中 citation 與來源實際文本一致率 100%（抽樣測試）。 - **SC-004**: 風險指令遺漏率 < 5%（以標註測試語料對比）。 - **SC-005**: 降級模式（無金鑰）仍覆蓋 User Story 1 全部 acceptance scenarios（測試清單 100% 通過）。 - **SC-006**: 無匹配情境下返回標準 UNKNOWN/ESCALATE 訊息的一致性 100%。 - **SC-007**: Postmortem/Handoff 工具產出模板欄位完整率 100%（欄位非空）。 - **SC-008**: 過期 runbook 標示準確率 100%（控制組實驗）。 ## Assumptions - 預設 Top K = 5，可由設定檔調整。 --> ### Functional Requirements - **FR-001**: System MUST [specific capability, e.g., "allow users to create accounts"] - **FR-002**: System MUST [specific capability, e.g., "validate email addresses"] - **FR-003**: Users MUST be able to [key interaction, e.g., "reset their password"] - **FR-004**: System MUST [data requirement, e.g., "persist user preferences"] - **FR-005**: System MUST [behavior, e.g., "log all security events"] *Example of marking unclear requirements:* - **FR-006**: System MUST authenticate users via [NEEDS CLARIFICATION: auth method not specified - email/password, SSO, OAuth?] - **FR-007**: System MUST retain user data for [NEEDS CLARIFICATION: retention period not specified] ### Key Entities *(include if feature involves data)* - **[Entity 1]**: [What it represents, key attributes without implementation] - **[Entity 2]**: [What it represents, relationships to other entities] ## Success Criteria *(mandatory)*  ### Measurable Outcomes - **SC-001**: [Measurable metric, e.g., "Users can complete account creation in under 2 minutes"] - **SC-002**: [Measurable metric, e.g., "System handles 1000 concurrent users without degradation"] - **SC-003**: [User satisfaction metric, e.g., "90% of users successfully complete primary task on first attempt"] - **SC-004**: [Business metric, e.g., "Reduce support tickets related to [X] by 50%"]

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Mars23003/OnCallRunbookMCPServer'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

spec.md•8.47 KiB