source-code-splitter.md•19.1 kB
# Source Code Splitter Architecture
## Overview
The Source Code Splitter transforms source code files into hierarchical, concatenable chunks that preserve semantic structure while enabling effective code search. The system uses tree-sitter for precise syntax tree parsing to detect semantic boundaries, creating context-aware chunks that respect language structure.
## Design Philosophy
### Tree-sitter Semantic Boundaries
The splitter uses tree-sitter parsers to identify semantic boundaries in source code, providing:
- **Precision**: Syntax tree analysis ensures accurate boundary detection
- **Language Awareness**: Native support for JavaScript, TypeScript, JSX, and TSX
- **Semantic Chunking**: One chunk per function/method/class with proper hierarchy
- **Hierarchical Structure**: Chunks maintain proper nesting relationships
### Documentation-Focused Chunking
The chunking strategy prioritizes indexing public interfaces and maintaining comment-signature relationships:
- **Public Interface Emphasis**: Captures outward-facing class methods and top-level functions
- **Comment Preservation**: Documentation comments stay with their associated code
- **Hierarchical Paths**: Enable semantic search within code structure
- **Concatenable Chunks**: Chunks can be reassembled to reconstruct original context
## Architecture Components
```mermaid
graph TD
subgraph "Tree-sitter Parsing"
A[Source Code File]
B[Language Detection]
C[Tree-sitter Parser]
D[Syntax Tree Generation]
end
subgraph "Boundary Extraction"
E[Semantic Boundary Detection]
F[Hierarchical Path Construction]
G[Boundary Classification]
end
subgraph "Chunk Processing"
H[Content Extraction]
I[TextSplitter Delegation]
J[Hierarchical Chunk Creation]
end
subgraph "Output Generation"
K[ContentChunk Array]
end
A --> B
B --> C
C --> D
D --> E
E --> F
F --> G
G --> H
H --> I
I --> J
J --> K
style A fill:#e1f5fe
style K fill:#f3e5f5
```
## Core Components
### 1. Tree-sitter Language Parsers
The system supports multiple languages through dedicated tree-sitter parsers:
**Supported Languages:**
- **JavaScript**: ES6+ classes, functions, arrow functions, JSX elements
- **TypeScript**: Interfaces, types, enums, namespaces, decorators, generics, TSX
- **JSX/TSX**: React component parsing with TypeScript integration
**Language Registry:**
The `LanguageParserRegistry` automatically selects the appropriate parser based on file extension and content analysis, falling back to `TextDocumentSplitter` for unsupported languages.
### 2. Semantic Boundary Detection
Tree-sitter parsers identify structural elements through syntax tree traversal:
**Primary Boundaries:**
- **Classes**: Complete class definitions with all members
- **Functions**: Top-level function declarations and expressions
- **Methods**: Class and interface method definitions
- **Interfaces**: TypeScript interface declarations
- **Namespaces**: TypeScript namespace blocks
- **Types**: Type alias definitions
**Boundary Extraction:**
Each boundary includes start/end positions, basic type classification, and optional name for context. The system focuses on structural boundaries rather than detailed metadata extraction.
### 3. Hierarchical Chunking Strategy
The chunking approach creates semantic units that respect code structure:
```mermaid
graph LR
subgraph "Chunk Hierarchy"
A[File Level] --> B[Class Level]
B --> C[Method Level]
A --> D[Function Level]
B --> E[Property Level]
end
subgraph "Path Structure"
F["['UserService.ts']"] --> G["['UserService.ts', 'UserService']"]
G --> H["['UserService.ts', 'UserService', 'getUser']"]
F --> I["['UserService.ts', 'calculateSum']"]
end
style A fill:#e1f5fe
style G fill:#f3e5f5
style H fill:#e8f5e8
```
**Chunking Rules (Canonical Ruleset):**
Core Principles:
- **Semantic Fidelity**: Boundaries align with grammar-level constructs (namespace/module, class, interface, enum, type alias, function/method/constructor).
- **Hierarchical Integrity**: Every chunk has a full path (e.g. `['File.ts', 'Namespace', 'ClassName', 'methodName']`).
- **Perfect Reconstructability**: Concatenating chunks in emission order reproduces the exact original file bytes.
- **Retrieval Granularity**: Prefer the smallest semantically meaningful unit (do not merge adjacent declarations).
Boundary Emission:
- Emit a primary chunk for each declaration that introduces a named scope or executable unit:
- Namespaces / modules
- Classes / interfaces / enums / type aliases
- Top-level functions (regular / async / arrow assigned via `const`)
- Methods / constructors (including `static` / `private` / accessor forms)
- Do NOT emit chunks for internal control-flow (`if`, `for`, `switch`, etc.) or nested local helper functions inside another function/method body (these remain part of the parent body chunk).
- Transparent wrappers (`export`, `declare`, modifiers) never suppress boundary emission; they are included in the declaration chunk content.
Classification (Dual Typing):
- `boundaryType: "structural"` for: namespace/module, class, interface, enum, type alias, import/export units.
- `boundaryType: "content"` for: function, method, constructor, arrow function, variable declaration introducing executable code.
- All chunks include `types: ['code']` for backward compatibility; semantic classification augments, not replaces, existing type labels.
Documentation & Signature Association:
- Preceding contiguous documentation comments (JSDoc / multi-line block / meaningful line comments) are merged into the declaration chunk.
- Documentation scan crosses transparent wrappers (e.g. `export` before `class`).
- Chunk `startLine` and `startByte` are adjusted to the first doc line when docs are present.
Atomicity & Non-Merging:
- Never merge two siblings (e.g. two methods, function + next function, class + first method).
- Never merge a structural declaration with its first child declaration.
- A method/function body is treated as a single atomic content region unless size splitting (see Size Management) is triggered.
Size Management (Universal Max Size Enforcement):
- No emitted chunk may exceed the configured maximum size (bytes or token estimate surrogate). This rule applies to:
- Declaration (structural) segments (e.g. large class/interface/enum declarations with decorators or long heritage clauses)
- Function/method/constructor bodies
- Interstitial / global content between declarations
- Any trailing or leading whitespace/comment regions
- Oversized segments are **delegated** to the `TextSplitter` AFTER semantic boundary determination so structural intent is preserved.
- Sub-chunks produced from delegation:
- Preserve strict original ordering
- Inherit the parent path (adding a deterministic ordinal suffix only if needed for uniqueness, e.g. `MyClass/doWork#1`, `MyClass/doWork#2`)
- Are all classified as `boundaryType: "content"` unless they correspond to the first structural declaration slice (which remains `structural` if it contains the signature/docs)
- Never duplicate signature or doc lines
- Structural declaration chunk strategy:
- Signature + docs remain in the first chunk (always under max size due to early split trigger on large bodies/content)
- Body content beyond the first segment is delegated in slices as needed
- Guarantees:
- Perfect reconstructability (concatenation of all chunks == original file)
- Deterministic chunk boundaries for identical inputs/config
- No late greedy merging stage; only size-driven subdivision
Path & Hierarchy:
- Each emitted chunk path is the full ancestry of structural declarations.
- Sub-chunks produced by size delegation extend that path deterministically (ordinal suffix or segmented identifier) without inventing new semantic ancestors.
Wrapper & Modifier Handling:
- `export`, `default`, `abstract`, `async`, visibility (`public|private|protected`), and decorators remain within the declaration chunk.
- Multiple stacked modifiers do not produce multiple chunks.
Content Integrity:
- No duplication: bytes belong to exactly one chunk except optional intentional signature/body split (if implemented).
- No gaps: every byte of the original file is covered by exactly one chunk (structural + body subdivision collectively).
Suppression Rules:
- Suppress nested function-like declarations only when they are local (declared inside another function/method body) AND not part of the public structural surface (they remain embedded implementation details).
- Do not suppress functions declared directly inside namespaces or classes (those emit boundaries).
Fallback & Error Resilience:
- On parser failure / zero boundaries: fall back to line-based splitter preserving reconstructability.
- On partial parse (ERROR nodes): still emit any confidently parsed boundaries; malformed regions become part of surrounding content.
Determinism:
- Given identical source input and configuration, chunk boundaries (names, byte ranges, hierarchy) are deterministic.
Extensibility:
- New structural kinds (e.g. `trait`, `record`, future TS constructs) must specify:
- Classification (structural vs content)
- Inclusion in documentation merging rules
- Whether they introduce a hierarchical path segment.
Implementation Notes:
- Boundary traversal is single-pass with documentation accumulation.
- Size threshold should be configurable (env / constructor option).
- Delegated sub-chunks MUST NOT exceed the max size individually.
- Token estimation (if used) should degrade gracefully to byte length.
Testing Guidelines:
- Assert reconstructability (join == original).
- Assert presence + uniqueness of expected paths.
- Assert doc block capture (start line alignment).
- Assert large body subdivision ordering & naming.
- Assert no emission for nested local helpers.
- Assert classification correctness (`structural` vs `content`).
Summary:
- Structural nodes = skeleton for hierarchical reassembly.
- Content nodes = precise retrieval targets.
- Large bodies subdivided AFTER boundary identification without erasing the semantic anchor.
### 4. Content Processing Pipeline
```mermaid
sequenceDiagram
participant TSS as TreesitterSourceCodeSplitter
participant LP as LanguageParser
participant BE as BoundaryExtractor
participant TS as TextSplitter
TSS->>LP: Parse source code
LP->>BE: Extract semantic boundaries
BE->>TSS: Return boundary positions
loop For each boundary
TSS->>TSS: Extract boundary content
TSS->>TS: Delegate large content sections
TS->>TSS: Return subdivided chunks
end
TSS->>TSS: Assign hierarchical paths
TSS->>TSS: Create ContentChunk array
```
## Processing Flow
### Semantic Chunking Process
```mermaid
flowchart TD
A[Source Code Input] --> B{Language Supported?}
B -->|Yes| C[Tree-sitter Parsing]
B -->|No| D[TextDocumentSplitter Fallback]
C --> E[Boundary Extraction]
E --> F[Content Sectioning]
F --> G[Hierarchical Path Assignment]
G --> H[Chunk Generation]
D --> I[Line-based Chunks]
H --> J[ContentChunk Array]
I --> J
style A fill:#e1f5fe
style J fill:#e8f5e8
style D fill:#fff3e0
```
### Chunk Generation Strategy
The system creates chunks that balance semantic meaning with search effectiveness:
**Chunk Types:**
- **Structural Chunks**: Class/function/interface signatures with documentation
- **Method Chunks**: Complete method implementations including comments
- **Content Chunks**: Code sections between structural boundaries
- **Delegated Chunks**: Large content sections processed by TextSplitter
#### Dual Typing & Boundary Classification
Each emitted `ContentChunk` for source code now uses a dual typing strategy:
- `types` array always includes `code` (backward compatibility for existing retrieval / embedding logic).
- A secondary semantic classification is derived from boundary analysis:
- `boundaryType: "structural"` for declarations that introduce _named, nestable scopes_ (classes, interfaces, enums, namespaces/modules, type aliases, import/export units).
- `boundaryType: "content"` for executable or implementation-level units (functions, methods, constructors, arrow functions, variable/lexical declarations that carry behavior, etc.).
Design goals:
1. Hierarchical Assembly: `structural` nodes act as stable anchors for subtree reconstruction (e.g. return an entire class when a method matches).
2. Precision in Retrieval: `content` nodes map to the most specific executable region, improving ranking granularity.
3. Non-Destructive Enrichment: Existing consumers relying only on `types.includes("code")` remain unaffected.
4. Future Extensibility: Additional refinement layers (e.g. `interface-signature`, `public-api`, `test-code`) can be layered without breaking current contracts.
Why not merge structural + content nodes pre-index?
- Merging obscures which textual spans define navigational structure vs. executable implementation.
- Preserving both classifications enables context-dependent expansion policies (e.g. include whole class vs. just matched method).
Boundary integrity + dual typing replace the need for post-hoc greedy size normalization.
**Path Inheritance:**
Each chunk inherits the hierarchical path of its containing structure, enabling context-aware search and proper reassembly.
## Error Handling & Fallback Strategy
### Graceful Degradation
The splitter handles various scenarios through layered fallback mechanisms:
```mermaid
graph TD
A[Input Source Code] --> B{Language Supported?}
B -->|No| C[TextDocumentSplitter]
B -->|Yes| D{Tree-sitter Parse Success?}
D -->|No| C
D -->|Yes| E{Boundaries Extracted?}
E -->|No| C
E -->|Yes| F[Semantic Chunking]
F --> G[ContentChunk Array]
C --> H[Line-based Chunks]
G --> I[Output]
H --> I
style A fill:#e1f5fe
style I fill:#e8f5e8
style C fill:#fff3e0
```
**Fallback Scenarios:**
- **Unsupported Languages**: Automatic delegation to TextDocumentSplitter
- **Parse Errors**: Graceful fallback for malformed syntax
- **Boundary Detection Failures**: Line-based processing for complex edge cases
- **Large File Handling**: For files exceeding a universal 32KB limit in the underlying parser, the system provides graceful degradation. It performs a full semantic parse on the initial ~32KB and falls back to `TextDocumentSplitter` for the remainder of the file, ensuring that no content is lost
### Error Recovery
The system maintains robust operation through:
- **Parse Error Isolation**: Errors in one section don't affect others
- **Content Preservation**: All source content is retained in chunks
- **Consistent Interface**: All fallback paths produce compatible ContentChunk arrays
## Language Extensibility
### Current Language Support
The tree-sitter implementation provides comprehensive support for web development languages:
**JavaScript (ES6+):**
- Classes with methods and properties
- Functions (regular, async, arrow functions for top-level declarations)
- JSX elements and React components
- Import/export statements
**TypeScript:**
- All JavaScript features plus type system constructs
- Interfaces and type alias definitions
- Namespaces and modules
- Enums and decorators
- Generic type parameters
- TSX (TypeScript + JSX)
**Python:**
- Classes, functions, and methods
- `import` and `from...import` statements
- Decorators and `async` functions
### Architecture for Extension
The modular design supports future language additions through:
**Parser Registry System:**
- Automatic language detection by file extension
- Fallback mechanisms for unsupported languages
- Consistent interface across all language parsers
**Tree-sitter Integration:**
- Leverages existing tree-sitter grammar ecosystem
- Language-specific parsers implement common boundary extraction interface
- Shared infrastructure for syntax tree traversal and boundary detection
**Future Language Candidates:**
- Java (packages, classes, methods)
- C# (namespaces, classes, properties, methods)
- Go (packages, structs, methods, functions)
- Rust (modules, structs, impl blocks, functions)
## Integration with Pipeline System
### Source Code Processing Pipeline
The tree-sitter splitter integrates seamlessly with the existing content processing infrastructure:
```mermaid
graph LR
subgraph "Source Code Processing"
A[SourceCodePipeline] --> B[TreesitterSourceCodeSplitter]
B --> C[ContentChunk Array]
C --> D[Embedding Generation]
D --> E[Vector Storage]
end
style A fill:#e1f5fe
style C fill:#f3e5f5
style E fill:#e8f5e8
```
### Rationale: Omission of Greedy Size-Based Merging
The previous design showed an additional GreedySplitter phase for size normalization. This has been intentionally removed for source code because:
- Structural Fidelity: Treesitter-derived chunks encode precise `{level, path}` hierarchies required for hierarchical reassembly and subtree reconstruction. Greedy merging collapses boundaries and degrades that signal.
- Retrieval Quality: HierarchicalAssemblyStrategy relies on intact structural units (class, method, function). Artificially merged aggregates reduce precision and introduce unrelated code into a match context.
- Reassembly Guarantees: The splitter already guarantees concatenability. Additional size-based merging provides negligible token efficiency gains compared to the semantic loss.
- JSON / Structured Parity: The same reasoning applies to JSON and other strictly nested formats—each structural node is meaningful even if its textual size is small.
- Simplicity & Predictability: A single semantic splitter reduces mental overhead, improves test determinism, and avoids edge cases where mixed-level metadata must be reconciled.
If future optimization is needed, it should be: (a) post-retrieval context window packing, or (b) language-aware micro-chunk collapsing that preserves explicit structural node boundaries—never generic greedy adjacency merging.
### System Benefits
**Enhanced Search Quality:**
- Semantic chunks respect code structure boundaries
- Hierarchical paths enable context-aware retrieval
- Documentation comments stay associated with relevant code
- Function and method retrieval maintains complete context
**Performance Characteristics:**
- Tree-sitter parsing provides linear time complexity
- Memory efficient processing without large intermediate structures
- Robust error handling prevents pipeline failures
- Maintains compatibility with existing chunk optimization systems
**Developer Experience:**
- Search results respect semantic boundaries
- Retrieved chunks include necessary surrounding context
- Hierarchical structure aids in understanding code relationships
- Consistent interface with other document processing pipelines