MCP Spark Documentation Server

CLAUDE.md•7.09 KiB

# Claude Code Instructions This file contains instructions for Claude Code when working with the MCP Spark Documentation Server codebase. ## Project Overview This is an MCP (Model Context Protocol) server that provides search and retrieval tools for Apache Spark documentation. It uses: - FastMCP for the MCP server framework - SQLite FTS5 for full-text search with BM25 ranking - Python-frontmatter for parsing markdown files with YAML frontmatter - Sparse Git checkout for efficient cloning of the docs directory ## Code Principles ### British English - Use British English in all code, comments, and documentation - Examples: "initialise" not "initialize", "colour" not "color" ### Type Safety - All functions must have type hints - Use modern union syntax: `str | None` instead of `Optional[str]` - Run mypy in strict mode before committing ### Error Handling - Use explicit error handling - Raise appropriate exceptions with descriptive messages - Never silently fail ### Code Style - Follow PEP 8 guidelines - Maximum line length: 120 characters - Use ruff for linting and formatting - See CODESTYLE.md for detailed style guidelines ## Development Workflow ### Making Changes 1. **Before starting**: - Ensure development environment is initialised: `make init` - Understand the existing code structure 2. **During development**: - Write tests for new functionality - Update type hints as needed - Add docstrings following Google style 3. **Before committing**: - Format code: `make format` - Run linter: `make lint` - Run type checker: `make typecheck` - Run tests: `make test` - Or run all checks: `make build` ### Testing - Write tests using pytest - Place tests in `tests/` directory - Mirror source structure in test files - Test both success and failure cases - Aim for high coverage (>80%) ### Common Tasks #### Adding a New Feature 1. Update the relevant model in `models.py` if needed 2. Implement the feature in the appropriate module 3. Add corresponding tests 4. Update documentation 5. Run `make build` to verify all checks pass #### Modifying the Parser When modifying `parser.py`: - Consider Spark's documentation structure - Update URL computation if needed - Test with various markdown formats - Update cleaning logic for Jekyll/markdown artifacts #### Updating Dependencies 1. Modify `pyproject.toml` 2. Run `make generate` to update lock file 3. Test thoroughly 4. Document the reason for the dependency #### Indexer Modifications When modifying `indexer.py`: - Consider network efficiency (sparse checkout) - Handle Git operations safely - Log progress appropriately - Test with different branches ## Project Structure ``` mcp-spark-documentation/ ├── src/mcp_spark_documentation/ │ ├── __init__.py # Package initialisation │ ├── models.py # Data models (Document, SearchResult, etc.) │ ├── database.py # SQLite FTS5 database operations │ ├── parser.py # Markdown file parser │ ├── indexer.py # Documentation indexer │ ├── server.py # FastMCP server implementation │ └── cli.py # Command-line interface ├── tests/ # Test files ├── data/ # SQLite database storage ├── pyproject.toml # Project configuration ├── Makefile # Build automation ├── Dockerfile # Container configuration └── README.md # Project documentation ``` ## MCP Tools The server exposes two MCP tools: 1. **search_documentation**: Full-text search with optional section filtering 2. **read_documentation**: Retrieve complete document content Both tools return JSON-formatted responses. ## Database Schema The database has two main tables: - `documents`: Main document storage - `documents_fts`: FTS5 virtual table for search Triggers keep the FTS index synchronised with the main table. ## Common Patterns ### Database Operations Always use the context manager pattern: ```python with self._get_connection() as conn: # Perform operations conn.commit() ``` ### Lazy Initialisation The server uses lazy initialisation for the database: ```python _database: DocumentDatabase | None = None def get_database() -> DocumentDatabase: global _database if _database is None: _database = DocumentDatabase(db_path) return _database ``` ### Error Messages Provide helpful error messages with suggestions: ```python return json.dumps({ "error": f"Document not found: {path}", "suggestion": "Use search_documentation to find valid document paths.", }) ``` ## Debugging ### Index Issues If the index isn't working correctly: 1. Check index statistics: `uv run spark-docs-index stats` 2. Rebuild the index: `uv run spark-docs-index index --rebuild` 3. Check database file permissions 4. Verify Git clone succeeded ### Search Not Finding Results 1. Verify stemming is working (Porter stemmer) 2. Check BM25 scoring weights 3. Examine the FTS5 query syntax 4. Test with simpler queries ### Docker Build Failures 1. Ensure Git is available in the container 2. Check network connectivity during build 3. Verify sparse checkout configuration 4. Check uv installation ## Documentation URLs Spark documentation URL pattern: ``` https://spark.apache.org/docs/latest/{path}.html ``` The parser removes `.md` extensions and adds `.html` when computing URLs. ## Git Operations The indexer uses sparse checkout to clone only the `docs/` directory: ```bash git clone --depth 1 --filter=blob:none --sparse --branch master ... git sparse-checkout set docs ``` This significantly reduces clone time and disk usage. ## Performance Considerations - **SQLite FTS5**: Provides fast full-text search with BM25 ranking - **Sparse checkout**: Reduces clone size from ~1GB to ~10MB - **Lazy loading**: Database initialised only when needed - **Connection pooling**: Not needed for SQLite (file-based) ## Maintenance ### Updating to New Spark Versions 1. Update the branch in indexer if needed 2. Rebuild the index: `uv run spark-docs-index index --rebuild --branch branch-X.Y` 3. Test search functionality 4. Update README with version info ### Monitoring Index Health Regularly check: - Document count: `uv run spark-docs-index stats` - Search result quality - Database file size - Query performance ## Troubleshooting ### Import Errors - Ensure virtual environment is activated - Run `make init` to sync dependencies - Check Python version (requires 3.12+) ### Type Check Failures - Review mypy output carefully - Update type hints as needed - Check for missing return type annotations ### Test Failures - Review test output for details - Check test database setup - Verify fixtures are correct - Run with verbose output: `pytest -vvv` ## Best Practices 1. **Never** commit without running `make build` 2. **Always** write tests for new features 3. **Keep** functions focused and single-purpose 4. **Use** type hints consistently 5. **Document** non-obvious behaviour 6. **Follow** British English spelling 7. **Update** documentation when changing behaviour 8. **Test** with different Spark documentation branches

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/martoc/mcp-spark-documentation'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

CLAUDE.md•7.09 KiB