de en es ja ko ru zh

Rootly MCP server

Official

by Rootly-AI-Labs

Overview Schema Related Servers Score Discussions

Python

Remote

Rootly-MCP-server

IMPROVEMENT_PLAN.md•11.3 KiB

# Rootly MCP Server Improvement Plan ## What We're Doing We're improving your Rootly MCP server in 6 phases over 8-10 weeks. The focus is on **security, code quality, testing, CI/CD, monitoring, and MCP best practices** - with **zero breaking changes** for existing users. **Note:** This plan has been reviewed and enhanced based on GPT-4o feedback. See `GPT4O_REVIEW.md` for detailed recommendations. --- ## The 6 Stages ### Stage 1: Fix Critical Security Issues (1.5 weeks) **What's wrong:** - Your API tokens are being logged (even partially) which is a security risk - Error messages contain full stack traces that expose internal file paths and implementation details - No validation that URLs and tokens are properly formatted before using them - Missing HTTPS enforcement and data-in-transit protection - No security audit or vulnerability scanning **What we'll do:** - Remove all token logging - just say "token configured" instead - Strip stack traces from error responses (log them server-side only) - Validate tokens and URLs before using them - Add timeout to all network requests - **NEW:** Enforce HTTPS for all API calls (reject HTTP) - **NEW:** Add input sanitization to prevent injection attacks - **NEW:** Implement rate limiting to prevent abuse - **NEW:** Run security audit with tools like `bandit` and `safety` - **NEW:** Add dependency vulnerability scanning to CI pipeline **Files affected:** `__main__.py`, `client.py`, `server.py`, and create new `security.py` **Why this matters:** These are high-severity security vulnerabilities that could expose sensitive data. Enhanced security measures protect against common attack vectors. --- ### Stage 2: Improve Error Handling (1 week) **What's wrong:** - The code catches all exceptions with generic `except Exception` (14 times!) - You can't tell what actually went wrong (timeout? auth failure? network issue?) - Some errors are silently ignored **What we'll do:** - Replace broad exception catching with specific exception types - Create custom exception classes (`RootlyAuthenticationError`, `RootlyNetworkError`, etc.) - Make error messages consistent and helpful - Add a decorator to handle errors automatically **Why this matters:** Better error handling means easier debugging and better error messages for users. --- ### Stage 3: Simplify Complex Code (2 weeks) **What's wrong:** - Some functions are massive: `get_oncall_shift_metrics()` is 313 lines! - Same code patterns repeated 3-4 times throughout - Hard to understand, test, and maintain **What we'll do:** - Break large functions into smaller, focused functions (max 50 lines each) - Extract duplicate code into reusable helpers - Replace magic numbers with named constants - Split code into logical modules (pagination, metrics, formatters) **Example:** - `get_oncall_shift_metrics()` (313 lines) → 6 smaller functions (~50 lines each) - `search_incidents()` (103 lines) → 3 smaller functions (~30 lines each) **Why this matters:** Simpler code is easier to understand, test, and modify. Less likely to have bugs. --- ### Stage 4: Add Comprehensive Tests & CI/CD (2 weeks) **What's wrong:** - Overall test coverage is only ~50% - Critical files have 0% test coverage: - `client.py` (handles all API requests) - 0 tests - `__main__.py` (entry point) - 0 tests - Data processing functions - 0 tests - Custom tools only 30% tested - No CI/CD pipeline for automated testing - No clear testing strategy or test organization **What we'll do:** - Add 200+ new tests organized by type: - **Unit tests:** Test individual functions in isolation (~150 tests) - **Integration tests:** Test API client + server interactions (~30 tests) - **End-to-end tests:** Test complete user workflows (~20 tests) - **Security tests:** Test authentication, authorization, input validation - Get overall coverage to >80% (>90% for critical paths) - Add performance benchmarks and load tests - **NEW:** Set up CI/CD pipeline (GitHub Actions): - Run all tests on every PR - Automated linting (black, ruff, mypy) - Security scanning (bandit, safety) - Coverage reporting - Automated deployment on merge to main - **NEW:** Add test fixtures and factories for consistent test data - **NEW:** Implement TDD approach for new features going forward **Why this matters:** Tests catch bugs before they reach production, make refactoring safer, and CI/CD ensures consistent quality on every change. --- ### Stage 5: MCP Protocol Best Practices (1 week) **What's wrong:** - Tools don't declare their output format (no schemas) - Can't cancel long-running operations - No prompt definitions (server is tool-only) - Missing pagination metadata **What we'll do:** - Add typed output schemas to all tools - Implement request cancellation - Add 5 useful prompts (incident triage, on-call handoff, postmortem, etc.) - Add pagination metadata to responses **Why this matters:** Better MCP compliance means better integration with Claude and other MCP clients. --- ### Stage 6: Monitoring, Observability & Documentation (2 weeks) **What we'll do:** - Run automated code formatters (black, ruff, isort) - Add comprehensive documentation: - Architecture overview - Security guidelines - Development guide - Complete API reference - Deployment and operations guide - Add performance benchmarks - Add detailed inline documentation - **NEW:** Implement structured logging: - Use structured logging format (JSON) - Add correlation IDs for request tracing - Log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL - Separate security audit logs - **NEW:** Add observability and monitoring: - Health check endpoint - Metrics collection (request count, latency, errors) - Performance monitoring - Error tracking and alerting - Request/response logging (sanitized) - **NEW:** Developer training materials: - Onboarding guide for new contributors - Code review checklist - Testing best practices guide **Why this matters:** Good documentation makes the codebase accessible to other developers. Monitoring and observability help quickly identify and resolve production issues. --- ## Key Results After all 6 stages: **Security:** - ✅ Zero high/critical vulnerabilities - ✅ No sensitive data in logs or errors - ✅ All inputs validated - ✅ HTTPS enforced for all connections - ✅ Rate limiting implemented - ✅ Security scanning in CI pipeline **Code Quality:** - ✅ All functions under 50 lines - ✅ Code complexity reduced by >50% - ✅ No duplicate code **Testing:** - ✅ >80% test coverage overall - ✅ >90% coverage for critical security/business logic - ✅ Tests run in under 5 minutes - ✅ Unit, integration, and E2E tests - ✅ Security and performance tests included **CI/CD:** - ✅ Automated testing on every PR - ✅ Automated linting and security scanning - ✅ Coverage reporting - ✅ Automated deployment **Monitoring & Observability:** - ✅ Structured logging with correlation IDs - ✅ Health check endpoint - ✅ Performance metrics collection - ✅ Error tracking and alerting **MCP Compliance:** - ✅ All tools have output schemas - ✅ Request cancellation supported - ✅ Prompt definitions available **No Breaking Changes:** - ✅ Existing API unchanged - ✅ Backward compatible - ✅ No migration required --- ## Tradeoffs & Decisions ### 1. Code Clarity vs Performance **Decision:** Prioritize code clarity **Why:** The server is I/O bound (waiting on API calls), not CPU bound. Clear code is more valuable than micro-optimizations that would save <5% of runtime. ### 2. Comprehensive Fixes vs Quick Wins **Decision:** Do it right, do it once **Why:** Quick fixes create technical debt. Taking 6-8 weeks now prevents years of maintenance headaches. ### 3. Test Coverage Goals **Decision:** 80% overall, 90% for critical paths **Why:** 100% coverage has diminishing returns. Focus on high-risk areas gives best ROI. ### 4. Breaking Changes **Decision:** Zero breaking changes **Why:** Existing users shouldn't need to change their code. All improvements are backward compatible. ### 5. Code Organization **Decision:** Split into multiple focused files **Why:** Better testability and maintainability. Slight increase in imports is worth it. --- ## What Could Go Wrong ### Risk 1: Refactoring Breaks Things **How we'll prevent it:** - Add comprehensive tests BEFORE refactoring - Change one function at a time - Keep old code until new code is verified - Use feature flags for new implementations ### Risk 2: Error Handling Changes Behavior **How we'll prevent it:** - Document all changes - Test all error scenarios before and after - Maintain backward compatible error response formats ### Risk 3: MCP Changes Break Clients **How we'll prevent it:** - Make all protocol enhancements backward compatible - Version schemas and prompts - Test with multiple client versions ### Risk 4: Team Resource Changes **How we'll prevent it:** - Document all decisions and rationale - Pair programming for knowledge transfer - Comprehensive documentation throughout - Regular code reviews to share knowledge - Backup plans for critical roles ### Risk 5: Unforeseen Technical Debt **How we'll prevent it:** - Regular code audits during refactoring - Track and prioritize technical debt items - Add buffer time to estimates - Use feature flags to isolate risky changes - Incremental rollout strategy --- ## Timeline ``` Week 1-1.5: ██████████ Stage 1: Critical Security Fixes Week 2-3: ████████████ Stage 2: Error Handling Week 4-5: ████████████ Stage 3: Code Simplification Week 6-7: ████████████ Stage 4: Tests & CI/CD Week 8: ████████ Stage 5: MCP Compliance Week 9-10: ████████████ Stage 6: Monitoring & Documentation Total: 8-10 weeks ``` **Deployment Strategy:** - Week 3: Deploy security + error handling together (Stage 1-2) - Week 5: Deploy refactored code (Stage 3) - Week 7: Deploy testing infrastructure + CI/CD (Stage 4) - Week 10: Deploy final improvements (Stages 5-6) **Monitoring Between Deployments:** - After each deployment: monitor for 3-5 days - Track error rates, performance metrics, user feedback - Rollback plan ready for each deployment --- ## Bottom Line **Time:** 8-10 weeks (more realistic than initial 6-8 week estimate) **Effort:** 1-2 full-time developers **Risk:** Medium (but mitigated with careful staging, CI/CD, and monitoring) **Breaking Changes:** None **Value:** Secure, maintainable, well-tested MCP server with CI/CD, monitoring, and best practices **Cost/Benefit:** - Initial investment: 8-10 weeks - Long-term savings: Fewer bugs, easier maintenance, faster feature development - Security: Protected against common vulnerabilities - Reliability: Automated testing and monitoring catch issues early The plan is comprehensive but practical. Each stage delivers value independently, so you could stop after any stage if needed. --- ## Changes From Original Plan Based on GPT-4o review, we added: 1. **Enhanced security** (HTTPS enforcement, rate limiting, security audits) 2. **CI/CD pipeline** (automated testing, deployment) 3. **Monitoring & observability** (structured logging, metrics, alerting) 4. **Detailed testing strategy** (unit, integration, E2E, security tests) 5. **Extended timeline** (8-10 weeks instead of 6-8 for realism) 6. **Additional risks** (team resources, technical debt) See `GPT4O_REVIEW.md` for complete feedback and recommendations.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Rootly-AI-Labs/Rootly-MCP-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

IMPROVEMENT_PLAN.md•11.3 KiB