Scantool - File Scanner MCP

CONTRIBUTING.md•13.8 KiB

# Contributing to File Scanner MCP This guide covers how to add support for a new programming language. ## Architecture Overview The codebase uses a **unified language system** where each language has a single file in `src/scantool/languages/` that provides both: - **Structure scanning**: Extract classes, functions, methods for `scan_file`/`scan_directory` - **Semantic analysis**: Extract imports, entry points, definitions for `code_map` and `preview_directory` ``` src/scantool/ ├── languages/ # Unified language system (one file per language) │ ├── __init__.py # Registry + auto-discovery │ ├── base.py # BaseLanguage class │ ├── models.py # Data models (StructureNode, ImportInfo, etc.) │ ├── skip_patterns.py # Directory/file skip patterns │ ├── python.py # PythonLanguage │ ├── typescript.py # TypeScriptLanguage │ └── ... # 20 languages total │ ├── scanner.py # Main orchestrator (uses languages/) ├── code_map.py # Code map analysis (uses languages/) ├── entropy/ # Saliency analysis (uses languages/ for function detection) └── server.py # MCP server tools ``` ## Adding a New Language Create a single file in `src/scantool/languages/` that inherits from `BaseLanguage`. ### Step 1: Create the Language File ```bash # Use an existing language as template cp src/scantool/languages/python.py src/scantool/languages/YOUR_LANGUAGE.py ``` ### Step 2: Implement Required Methods ```python from typing import Optional from .base import BaseLanguage from .models import StructureNode, ImportInfo, EntryPointInfo, DefinitionInfo, CallInfo class YourLanguage(BaseLanguage): """Unified language handler for YourLanguage files.""" # === Metadata (REQUIRED) === @classmethod def get_extensions(cls) -> list[str]: return [".your", ".ext"] @classmethod def get_language_name(cls) -> str: return "YourLanguage" @classmethod def get_priority(cls) -> int: return 10 # Higher = preferred when multiple languages match # === Structure Scanning (REQUIRED) === def scan(self, source_code: bytes) -> Optional[list[StructureNode]]: """Extract classes, functions, methods from source code.""" # Use tree-sitter or regex to parse pass # === Semantic Analysis (REQUIRED) === def extract_imports(self, file_path: str, content: str) -> list[ImportInfo]: """Extract import/use/require statements.""" pass def find_entry_points(self, file_path: str, content: str) -> list[EntryPointInfo]: """Find main functions, app instances, exports.""" pass # === Optional Methods === # extract_definitions() - Default reuses scan() output # extract_calls() - Default returns empty list # classify_file() - Default uses path-based heuristics # should_skip() - Default returns False # should_analyze() - Default returns True # is_low_value_for_inventory() - Identifies small/boilerplate files # resolve_import_to_file() - Enables import graph building # format_entry_point() - Custom display formatting ``` ### Step 3: Test It ```bash uv run python -c " from scantool.languages import get_language # Test language registration lang = get_language('.your') print(f'Language: {lang.get_language_name()}') # Test scanning code = open('tests/yourlang/samples/basic.your', 'rb').read() structures = lang.scan(code) for s in structures: print(f' {s.type}: {s.name}') # Test imports content = open('tests/yourlang/samples/basic.your').read() imports = lang.extract_imports('test.your', content) for imp in imports: print(f' Import: {imp.target_module}') " ``` ### Key Design Principles 1. **One file per language**: Combines scanner + analyzer into a single `BaseLanguage` subclass 2. **Reuse scan() output**: `extract_definitions()` defaults to converting `scan()` output, avoiding duplicate parsing 3. **Auto-discovery**: Place the file in `languages/` and it's automatically registered 4. **Tree-sitter preferred**: Use tree-sitter for AST-based parsing with regex fallback for malformed files --- ## Complete Example: Adding Ruby Support ### 1. Create the Language File **File**: `src/scantool/languages/ruby.py` ```python """Ruby language support.""" from typing import Optional import re try: import tree_sitter_ruby from tree_sitter import Language, Parser TREE_SITTER_AVAILABLE = True except ImportError: TREE_SITTER_AVAILABLE = False from .base import BaseLanguage from .models import StructureNode, ImportInfo, EntryPointInfo class RubyLanguage(BaseLanguage): """Unified language handler for Ruby files.""" def __init__(self, **kwargs): super().__init__(**kwargs) if TREE_SITTER_AVAILABLE: self.parser = Parser() self.parser.language = Language(tree_sitter_ruby.language()) else: self.parser = None @classmethod def get_extensions(cls) -> list[str]: return [".rb", ".rake", ".gemspec"] @classmethod def get_language_name(cls) -> str: return "Ruby" @classmethod def get_priority(cls) -> int: return 10 def scan(self, source_code: bytes) -> Optional[list[StructureNode]]: """Scan Ruby source code.""" if not self.parser: return self._fallback_extract(source_code) try: tree = self.parser.parse(source_code) if self._has_too_many_errors(tree.root_node): return self._fallback_extract(source_code) return self._extract_structure(tree.root_node, source_code) except Exception as e: return [StructureNode( type="error", name=f"Parse error: {e}", start_line=1, end_line=1 )] def _extract_structure(self, root, source_code: bytes) -> list[StructureNode]: """Extract structure using tree-sitter.""" structures = [] # ... traverse AST and build StructureNode list return structures def _fallback_extract(self, source_code: bytes) -> list[StructureNode]: """Regex fallback for broken files.""" text = source_code.decode('utf-8', errors='replace') structures = [] for match in re.finditer(r'^class\s+(\w+)', text, re.MULTILINE): line = text[:match.start()].count('\n') + 1 structures.append(StructureNode( type="class", name=match.group(1), start_line=line, end_line=line )) for match in re.finditer(r'^ def\s+(\w+)', text, re.MULTILINE): line = text[:match.start()].count('\n') + 1 structures.append(StructureNode( type="method", name=match.group(1), start_line=line, end_line=line )) return structures def extract_imports(self, file_path: str, content: str) -> list[ImportInfo]: """Extract require/require_relative statements.""" imports = [] for match in re.finditer(r"require\s+['\"]([^'\"]+)['\"]", content): line = content[:match.start()].count('\n') + 1 imports.append(ImportInfo( source_file=file_path, target_module=match.group(1), line=line, import_type="require" )) for match in re.finditer(r"require_relative\s+['\"]([^'\"]+)['\"]", content): line = content[:match.start()].count('\n') + 1 imports.append(ImportInfo( source_file=file_path, target_module=match.group(1), line=line, import_type="require_relative" )) return imports def find_entry_points(self, file_path: str, content: str) -> list[EntryPointInfo]: """Find entry points in Ruby file.""" entry_points = [] # if __FILE__ == $0 if re.search(r'if\s+__FILE__\s*==\s*\$0', content): match = re.search(r'if\s+__FILE__\s*==\s*\$0', content) line = content[:match.start()].count('\n') + 1 entry_points.append(EntryPointInfo( file=file_path, type="if_file", name="$0", line=line )) # Rails/Sinatra app detection if 'Sinatra::Base' in content or 'Rails.application' in content: entry_points.append(EntryPointInfo( file=file_path, type="app_instance", name="app", line=1 )) return entry_points ``` ### 2. Add Dependencies ```toml # Add to pyproject.toml dependencies: "tree-sitter-ruby>=0.23.0", ``` Then run: ```bash uv sync ``` ### 3. Create Test Files **Directory structure**: `tests/ruby/samples/basic.rb` ```ruby require 'json' require_relative 'helper' class UserManager def initialize(database) @database = database end def create_user(name, email) @database.insert(name: name, email: email) end end def validate_email(email) email.include?("@") end if __FILE__ == $0 puts "Running..." end ``` ### 4. Create Tests **File**: `tests/ruby/test_ruby.py` ```python """Tests for Ruby language.""" from scantool.scanner import FileScanner def test_basic_parsing(file_scanner): """Test basic Ruby file parsing.""" structures = file_scanner.scan_file("tests/ruby/samples/basic.rb") assert structures is not None assert any(s.type == "class" and s.name == "UserManager" for s in structures) def test_imports(): """Test import extraction.""" from scantool.languages.ruby import RubyLanguage lang = RubyLanguage() content = open("tests/ruby/samples/basic.rb").read() imports = lang.extract_imports("basic.rb", content) assert len(imports) >= 2 assert any(imp.target_module == "json" for imp in imports) def test_entry_points(): """Test entry point detection.""" from scantool.languages.ruby import RubyLanguage lang = RubyLanguage() content = open("tests/ruby/samples/basic.rb").read() entry_points = lang.find_entry_points("basic.rb", content) assert len(entry_points) >= 1 assert any(ep.type == "if_file" for ep in entry_points) ``` ### 5. Run Tests ```bash # Run language-specific tests uv run pytest tests/ruby/ -v # Run all tests uv run pytest ``` --- ## Two-Tier Noise Filtering Languages integrate with two-tier skip system: **Tier 1**: Directory/file patterns (fast, structural) - Handled by `skip_patterns.py`: COMMON_SKIP_DIRS, COMMON_SKIP_FILES - Filters .git/, node_modules/, .pyc before language sees them **Tier 2**: Language-specific patterns (semantic) - Handled by `should_analyze()` in your language class - Filters minified JS, type declarations, generated files ### Example ```python def should_analyze(self, file_path: str) -> bool: filename = Path(file_path).name.lower() # Skip minified files if filename.endswith('.min.js'): return False # Skip generated files if filename.endswith('.pb.go'): return False return True ``` --- ## Key Methods Reference | Method | Purpose | Default | |--------|---------|---------| | `get_extensions()` | File extensions to handle | **Required** | | `get_language_name()` | Human-readable name | **Required** | | `scan()` | Extract structure from bytes | **Required** | | `extract_imports()` | Find import statements | **Required** | | `find_entry_points()` | Find main/app instances | **Required** | | `extract_definitions()` | Get functions/classes | Reuses `scan()` | | `extract_calls()` | Find function calls | Returns `[]` | | `should_skip()` | Skip file before reading | Returns `False` | | `should_analyze()` | Skip file after reading | Returns `True` | | `classify_file()` | Categorize file | Path-based heuristics | | `resolve_import_to_file()` | Map import to file path | Returns `None` | | `format_entry_point()` | Display formatting | Default format | --- ## Checklist for New Languages - [ ] Create `src/scantool/languages/LANG.py` - [ ] Implement required methods (metadata, scan, imports, entry points) - [ ] Add tree-sitter dependency to `pyproject.toml` - [ ] Create test directory: `tests/LANG/samples/` - [ ] Create test file: `tests/LANG/test_LANG.py` - [ ] Run tests: `uv run pytest tests/LANG/` - [ ] Run all tests: `uv run pytest` --- ## Examples to Study | File | Features | |------|----------| | `python.py` | Full-featured: tree-sitter, signatures, decorators, docstrings, complexity | | `typescript.py` | Multiple extensions (.ts, .tsx, .js), JSDoc extraction | | `go.py` | Simple imports, method receivers, generated file skipping | | `swift.py` | @main detection, protocol extraction, SwiftUI patterns | | `generic.py` | Fallback for unsupported extensions | --- ## Debugging Tips ### Verify Auto-Discovery ```bash uv run python -c " from scantool.languages import get_registry registry = get_registry() print('Extensions:', list(registry.extensions())) print('Language for .py:', registry.get('.py')) " ``` ### Inspect Tree-Sitter AST ```python from tree_sitter import Language, Parser import tree_sitter_YOUR_LANG parser = Parser() parser.language = Language(tree_sitter_YOUR_LANG.language()) with open("test.ext", "rb") as f: tree = parser.parse(f.read()) print(tree.root_node.sexp()) ``` ### Test with MCP Tools ```bash uv run python -c " from scantool.scanner import FileScanner scanner = FileScanner() result = scanner.scan_file('path/to/file.ext') for node in result: print(f'{node.type}: {node.name} @{node.start_line}') " ``` --- ## Getting Help - **Examples**: Check existing languages in `src/scantool/languages/` - **Issues**: [GitHub Issues](https://github.com/mariusei/file-scanner-mcp/issues)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/mariusei/file-scanner-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

CONTRIBUTING.md•13.8 KiB