Markdown RAG Documentation

Overview Schema Related Servers Score Discussions

ragdocs-mcp
specs

13-txt-file-chunking.md•23.4 KiB

# 13. Plain Text (.txt) File Chunking

## Executive Summary

This spec proposes adding `.txt` file indexing support to mcp-markdown-ragdocs. Unlike markdown files with header-based structure, plain text lacks semantic boundaries. Four chunking strategies were evaluated: paragraph-based (current fallback), sentence-based, semantic boundary detection, and sliding window. **Paragraph-based chunking** (Strategy 1) is selected as the optimal approach due to its simplicity (15 LOC, reuses existing logic), zero dependencies, and proven effectiveness in the current `_chunk_plain_text()` fallback. Implementation requires only parser registration and configuration—no new chunking code needed.

**Timeline:** 1-2 hours. **Risk:** Low (existing code path).

---

## 1. Goals & Non-Goals

### Goals
- Index `.txt` files with same search quality as markdown
- Reuse existing chunking infrastructure (minimal code duplication)
- Maintain configuration consistency (same min/max/overlap parameters)
- Support all current features (semantic search, keyword search, graph traversal)

### Non-Goals
- Natural language processing (NLP) for sentence boundaries (avoid spaCy/NLTK dependencies)
- Language-specific tokenization (assume UTF-8 text)
- Custom chunking strategies per file extension (single strategy for all `.txt`)
- Retroactive re-chunking of existing markdown files

---

## 2. Current State Analysis

### 2.1. Architecture Overview

**Chunking Pipeline:**
```
Parser → Chunker → IndexManager → [VectorIndex, KeywordIndex, GraphStore]
```

**Key Components:**

| Component | File | Role |
|-----------|------|------|
| `ChunkingStrategy` | [src/chunking/base.py](../src/chunking/base.py) | Abstract protocol defining `chunk_document()` |
| `HeaderBasedChunker` | [src/chunking/header_chunker.py](../src/chunking/header_chunker.py) | Markdown-specific implementation (292 LOC) |
| `get_chunker()` | [src/chunking/factory.py](../src/chunking/factory.py) | Factory function returning strategy based on config |
| `IndexManager` | [src/indexing/manager.py](../src/indexing/manager.py) | Coordinates parsing, chunking, and index updates |
| `dispatch_parser()` | [src/parsers/dispatcher.py](../src/parsers/dispatcher.py) | Routes file extensions to parsers |

**Configuration:**
```python
@dataclass
class ChunkingConfig:
    strategy: str = "header_based"
    min_chunk_chars: int = 200
    max_chunk_chars: int = 2000
    overlap_chars: int = 100
    include_parent_headers: bool = True
    parent_retrieval_enabled: bool = False
    parent_chunk_min_chars: int = 1500
    parent_chunk_max_chars: int = 2000
```

### 2.2. Existing Plain Text Handling

**Fallback Behavior:**
`HeaderBasedChunker` already handles plain text via `_chunk_plain_text()` (lines 274-347):
- Splits at paragraph boundaries (`\n\n+`)
- Respects `min_chunk_chars` and `max_chunk_chars`
- Tracks `start_pos` and `end_pos` for each chunk
- Sets `header_path = ""` (no header hierarchy)

**Limitations:**
- Only triggered when markdown has **no headers**
- No explicit support for `.txt` files (dispatcher only registers markdown)

**Opportunity:**
`.txt` files can reuse `_chunk_plain_text()` with **zero code changes**. Only parser registration needed.

---

## 3. Alternative Chunking Strategies

### Strategy 1: Paragraph-Based (RECOMMENDED) ✅

**Description:**
Split text at double newlines (`\n\n+`). Combine paragraphs until `max_chunk_chars` reached.

**Implementation:**
Reuse existing `HeaderBasedChunker._chunk_plain_text()`:
```python
# src/chunking/header_chunker.py (lines 274-347)
paragraphs = re.split(r'\n\n+', content)
```

**Pros:**
- ✅ **Zero new code** (reuses existing logic)
- ✅ Natural semantic boundaries (paragraphs = coherent units)
- ✅ Respects size constraints (min/max/overlap)
- ✅ No dependencies (stdlib `re` module)
- ✅ Already tested via markdown fallback path

**Cons:**
- ⚠️ Poor for dense text without paragraphs (e.g., logs)
- ⚠️ Fails on non-paragraph structured text (lists, tables)

**Search Quality Impact:** **High** (proven in production for headerless markdown)

**Implementation Complexity:** **15 LOC** (parser + config only)

**Impact/Cost Ratio:** **5.0** (best)

---

### Strategy 2: Sentence-Based

**Description:**
Split text into sentences using regex or NLP library (spaCy/NLTK). Group sentences into chunks.

**Implementation:**
```python
def _chunk_by_sentences(self, document: Document) -> list[Chunk]:
    # Sentence boundary detection
    sentences = re.split(r'(?<=[.!?])\s+', document.content)

    chunks = []
    current = ""
    for sent in sentences:
        if len(current) + len(sent) <= self.config.max_chunk_chars:
            current += " " + sent
        else:
            chunks.append(current)
            current = sent
    return chunks
```

**Pros:**
- ✅ Finer-grained chunks (better for QA)
- ✅ Natural linguistic boundaries

**Cons:**
- ❌ Regex sentence splitting unreliable (abbreviations, URLs, code)
- ❌ NLP libraries add heavy dependencies (spaCy: 100MB+)
- ❌ Slower processing (tokenization overhead)
- ❌ Poor for non-prose text (logs, config files)

**Search Quality Impact:** **Medium** (marginal improvement over paragraphs)

**Implementation Complexity:** **80 LOC** (sentence detection + chunking)

**Impact/Cost Ratio:** **0.5**

---

### Strategy 3: Semantic Boundary Detection

**Description:**
Use embeddings to detect topic shifts. Split when cosine similarity between adjacent paragraphs drops below threshold.

**Implementation:**
```python
def _chunk_by_semantic_boundaries(self, document: Document) -> list[Chunk]:
    paragraphs = re.split(r'\n\n+', document.content)
    embeddings = [self.model.embed(p) for p in paragraphs]

    chunks = []
    current_chunk = [paragraphs[0]]

    for i in range(1, len(paragraphs)):
        similarity = cosine_similarity(embeddings[i-1], embeddings[i])
        if similarity < 0.7:  # Topic shift threshold
            chunks.append("\n\n".join(current_chunk))
            current_chunk = [paragraphs[i]]
        else:
            current_chunk.append(paragraphs[i])

    return chunks
```

**Pros:**
- ✅ Semantically coherent chunks
- ✅ Adapts to content structure

**Cons:**
- ❌ Requires embedding model at index time (2x indexing time)
- ❌ Threshold tuning needed per corpus
- ❌ Poor for small documents (not enough paragraphs)
- ❌ Computationally expensive (O(n) embeddings per document)

**Search Quality Impact:** **High** (optimal coherence)

**Implementation Complexity:** **150 LOC** (embedding pipeline + boundary detection)

**Impact/Cost Ratio:** **0.3**

---

### Strategy 4: Sliding Window

**Description:**
Fixed-size chunks with configurable overlap. No semantic awareness.

**Implementation:**
```python
def _chunk_by_sliding_window(self, document: Document) -> list[Chunk]:
    content = document.content
    chunk_size = self.config.max_chunk_chars
    overlap = self.config.overlap_chars

    chunks = []
    start = 0
    while start < len(content):
        end = start + chunk_size
        chunks.append(content[start:end])
        start += (chunk_size - overlap)

    return chunks
```

**Pros:**
- ✅ Simplest implementation (20 LOC)
- ✅ Predictable chunk sizes
- ✅ No dependencies

**Cons:**
- ❌ Breaks mid-sentence (poor readability)
- ❌ Ignores paragraph boundaries (low coherence)
- ❌ Suboptimal for search (random splits)

**Search Quality Impact:** **Low** (degrades relevance)

**Implementation Complexity:** **20 LOC**

**Impact/Cost Ratio:** **0.2**

---

## 4. Decision Matrix

| Strategy | Search Quality | Complexity (LOC) | Dependencies | Speed | Impact/Cost |
|----------|----------------|------------------|--------------|-------|-------------|
| **1. Paragraph-Based** ✅ | ⭐⭐⭐⭐ High | **15** | None | Fast | **5.0** |
| 2. Sentence-Based | ⭐⭐⭐ Medium | 80 | spaCy/NLTK | Slow | 0.5 |
| 3. Semantic Boundaries | ⭐⭐⭐⭐⭐ Highest | 150 | Embedding model | Very Slow | 0.3 |
| 4. Sliding Window | ⭐⭐ Low | 20 | None | Fast | 0.2 |

**Selected Strategy:** **Paragraph-Based** (Strategy 1)

**Rationale:**
1. **Proven reliability:** Already deployed in `_chunk_plain_text()` fallback
2. **Minimal risk:** Zero new chunking logic (only parser registration)
3. **Cost efficiency:** 15 LOC vs. 80-150 LOC for alternatives
4. **No dependencies:** Stdlib only (no spaCy/NLTK overhead)
5. **User preference alignment:** "simplicity, real implementation over complexity"

---

## 5. Proposed Solution

### 5.1. High-Level Design

**Approach:** Register a `PlainTextParser` that outputs `Document` objects, then reuse `HeaderBasedChunker._chunk_plain_text()` via existing fallback path.

**Key Insight:** `.txt` files are semantically equivalent to "markdown with no headers"—the chunking logic is identical.

### 5.2. Component Changes

#### 5.2.1. New Parser (NEW)

```python
# src/parsers/plaintext.py (~30 LOC)
from pathlib import Path
from datetime import datetime, timezone
from src.models import Document
from src.parsers.base import DocumentParser


class PlainTextParser(DocumentParser):
    """Parser for plain text (.txt) files."""

    def parse(self, file_path: str) -> Document:
        path = Path(file_path)

        if not path.exists():
            raise FileNotFoundError(f"File not found: {file_path}")

        # Try UTF-8 first, fall back to common encodings
        content = None
        for encoding in ["utf-8", "latin-1", "cp1252", "iso-8859-1"]:
            try:
                content = path.read_text(encoding=encoding, errors="strict")
                break
            except (UnicodeDecodeError, LookupError):
                continue

        if content is None:
            raise UnicodeDecodeError("utf-8", b"", 0, 1, f"Cannot decode {file_path}")

        modified_time = datetime.fromtimestamp(path.stat().st_mtime, tz=timezone.utc)

        return Document(
            content=content,
            metadata={"source": str(path)},
            links=[],  # No link extraction for plain text
            tags=[],   # No tag extraction
            file_path=str(path),
            modified_time=modified_time,
        )
```

**Design Notes:**
- Follows `MarkdownParser` encoding fallback pattern (lines 34-48)
- No frontmatter parsing (plain text has no YAML headers)
- No link extraction (no markdown syntax)
- Minimal metadata (source file path only)

#### 5.2.2. Parser Registration

```python
# src/parsers/__init__.py
from src.parsers.markdown import MarkdownParser
from src.parsers.plaintext import PlainTextParser

__all__ = ["MarkdownParser", "PlainTextParser"]
```

```python
# src/parsers/dispatcher.py (modify dispatch_parser function)
from src.parsers.plaintext import PlainTextParser

def dispatch_parser(file_path: str, config: Config):
    path = PurePosixPath(file_path)

    for pattern in config.parsers.keys():
        parser_name = config.parsers[pattern]

        # Match against pattern
        if path.match(pattern) or (pattern.startswith("**/") and path.match(pattern.replace("**/", ""))):
            if parser_name == "MarkdownParser":
                return MarkdownParser()
            elif parser_name == "PlainTextParser":
                return PlainTextParser()

    raise ValueError(f"No parser registered for file: {file_path}")
```

#### 5.2.3. Configuration

```toml
# config.toml (add to [parsers] section)
[parsers]
"**/*.md" = "MarkdownParser"
"**/*.txt" = "PlainTextParser"  # NEW
```

**Rebuild Trigger:**
Adding `"**/*.txt"` to parsers config will trigger index rebuild (manifest version mismatch).

#### 5.2.4. Chunking Integration

**No code changes needed.** `HeaderBasedChunker.chunk_document()` already handles documents without headers:

```python
# src/chunking/header_chunker.py (lines 28-44, EXISTING CODE)
def chunk_document(self, document: Document) -> list[Chunk]:
    content_bytes = bytes(document.content, "utf8")
    tree = self.parser.parse(content_bytes)  # Parses as markdown
    root_node = tree.root_node

    headers = self._extract_headers(root_node, content_bytes)

    if not headers:  # ← Plain text triggers this path
        return self._chunk_plain_text(document)  # ← Reuses existing logic

    # ... markdown chunking ...
```

**Behavior:**
1. `PlainTextParser` returns `Document` with plain text content
2. `HeaderBasedChunker` attempts to parse as markdown (tree-sitter)
3. No headers found (plain text is valid markdown with no headers)
4. Falls back to `_chunk_plain_text()` (paragraph-based chunking)

**Validation:**
Plain text follows identical chunking path as headerless markdown—this code path is already tested and deployed.

---

## 6. Implementation Plan

### Phase 1: Parser Implementation (30 min)

**Tasks:**
- [ ] Create `src/parsers/plaintext.py`
- [ ] Implement `PlainTextParser.parse()`
- [ ] Add UTF-8 + fallback encoding support
- [ ] Add to `src/parsers/__init__.py` exports

**Acceptance Criteria:**
- `PlainTextParser().parse("test.txt")` returns valid `Document`
- Handles UTF-8 and latin-1 encodings
- Modified time preserved from file metadata

### Phase 2: Parser Registration (15 min)

**Tasks:**
- [ ] Modify `src/parsers/dispatcher.py` to handle parser name lookup
- [ ] Add `PlainTextParser` import and dispatch logic
- [ ] Update default config template with `"**/*.txt"` pattern

**Acceptance Criteria:**
- `dispatch_parser("doc.txt", config)` returns `PlainTextParser`
- `dispatch_parser("doc.md", config)` still returns `MarkdownParser`

### Phase 3: Testing (30 min)

**Tasks:**
- [ ] Unit test: `test_plaintext_parser()` (parse, encoding, errors)
- [ ] Integration test: `test_txt_file_indexing()` (end-to-end)
- [ ] Test chunk sizes match config (min/max/overlap)
- [ ] Test search retrieval from `.txt` chunks

**Test Cases:**

| Test | Input | Expected Output |
|------|-------|-----------------|
| `test_parse_utf8` | `hello.txt` (UTF-8) | `Document(content="hello world", ...)` |
| `test_parse_latin1` | `legacy.txt` (latin-1) | `Document(...)` with fallback encoding |
| `test_chunk_paragraphs` | Multi-paragraph `.txt` | Chunks split at `\n\n`, sizes within bounds |
| `test_search_txt_chunks` | Query + indexed `.txt` | Returns relevant chunks with scores |

### Phase 4: Documentation (15 min)

**Tasks:**
- [ ] Update `docs/configuration.md` with `.txt` parser example
- [ ] Add `.txt` support to `README.md` features list
- [ ] Document encoding handling in `docs/development.md`

---

## 7. Testing Strategy

### 7.1. Unit Tests

**File:** `tests/unit/test_plaintext_parser.py` (~50 LOC)

```python
import pytest
from pathlib import Path
from src.parsers.plaintext import PlainTextParser

def test_parse_simple_text(tmp_path):
    file = tmp_path / "test.txt"
    file.write_text("Hello world\n\nSecond paragraph")

    parser = PlainTextParser()
    doc = parser.parse(str(file))

    assert doc.content == "Hello world\n\nSecond paragraph"
    assert doc.metadata["source"] == str(file)
    assert doc.links == []
    assert doc.tags == []

def test_parse_encoding_fallback(tmp_path):
    file = tmp_path / "latin.txt"
    file.write_bytes(b"Caf\xe9")  # Latin-1 encoded "Café"

    parser = PlainTextParser()
    doc = parser.parse(str(file))

    assert "Caf" in doc.content  # Successfully decoded

def test_parse_nonexistent_file():
    parser = PlainTextParser()
    with pytest.raises(FileNotFoundError):
        parser.parse("/nonexistent/file.txt")
```

### 7.2. Integration Tests

**File:** `tests/integration/test_txt_indexing.py` (~80 LOC)

```python
def test_index_and_search_txt_file(tmp_path, index_manager, config):
    # Create test .txt file
    txt_file = tmp_path / "notes.txt"
    txt_file.write_text(
        "Machine Learning Notes\n\n"
        "Neural networks are computational models.\n\n"
        "Training requires labeled data."
    )

    # Index document
    index_manager.index_document(str(txt_file))

    # Query
    results = index_manager.vector.search("neural networks", top_k=5)

    # Verify chunk retrieval
    assert len(results) > 0
    chunk = results[0]
    assert "neural networks" in chunk.content.lower()
    assert chunk.header_path == ""  # No headers in plain text

def test_txt_chunking_respects_size_limits(tmp_path, config):
    # Create large plain text file
    txt_file = tmp_path / "large.txt"
    paragraphs = [f"Paragraph {i} content." * 50 for i in range(100)]
    txt_file.write_text("\n\n".join(paragraphs))

    # Parse and chunk
    parser = PlainTextParser()
    doc = parser.parse(str(txt_file))

    chunker = HeaderBasedChunker(config.chunking)
    chunks = chunker.chunk_document(doc)

    # Verify constraints
    for chunk in chunks:
        assert len(chunk.content) >= config.chunking.min_chunk_chars
        assert len(chunk.content) <= config.chunking.max_chunk_chars
```

### 7.3. Performance Validation

**Benchmark:** Index 100 `.txt` files (1-10KB each), measure:
- Indexing time (should match markdown ±10%)
- Query latency (no degradation)
- Memory usage (no significant increase)

---

## 8. Risks & Mitigations

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| Encoding issues with non-UTF-8 files | Medium | Medium | Fallback encoding sequence (UTF-8 → latin-1 → cp1252) |
| Poor chunking for dense text (logs) | Low | Low | Document as limitation; consider sliding window in future |
| Index rebuild required | High | Low | Automatic on config change (existing mechanism) |
| User confusion (why rebuild?) | Medium | Low | Log clear message: "New parser added, rebuilding index..." |

---

## 9. File Manifest

### Files to Create
- [x] `src/parsers/plaintext.py` (~30 LOC)
- [x] `tests/unit/test_plaintext_parser.py` (~50 LOC)
- [x] `tests/integration/test_txt_indexing.py` (~80 LOC)

### Files to Modify
- [x] `src/parsers/__init__.py` (+2 LOC: import + export)
- [x] `src/parsers/dispatcher.py` (+8 LOC: parser name dispatch)
- [x] `config.toml` (+1 LOC: `.txt` pattern)
- [x] `docs/configuration.md` (+15 LOC: parser example)
- [x] `docs/development.md` (+10 LOC: encoding notes)
- [x] `README.md` (+1 LOC: features list)

**Total LOC:** ~180 (90% tests, 10% production)

---

## 10. Future Enhancements (Out of Scope)

### 10.1. File Type-Specific Strategies
Allow different chunking strategies per file type:
```toml
[chunking.strategies]
"**/*.md" = "header_based"
"**/*.txt" = "paragraph_based"
"**/*.log" = "sliding_window"
```

**Complexity:** High (requires refactoring `get_chunker()` and config schema)

**Value:** Medium (most `.txt` files are prose; logs are rare)

### 10.2. Sentence-Based Chunking
Add optional sentence-level chunking for QA workloads:
```toml
[chunking]
strategy = "sentence_based"  # Requires spaCy
```

**Complexity:** High (80 LOC + spaCy dependency)

**Value:** Low (marginal search quality improvement)

### 10.3. Metadata Extraction
Extract dates, URLs, or email addresses from plain text:
```python
metadata = {
    "dates": ["2025-01-08"],
    "urls": ["https://example.com"],
    "emails": ["user@example.com"]
}
```

**Complexity:** Medium (regex patterns + extraction logic)

**Value:** Medium (useful for certain corpora, e.g., logs or emails)

---

## 11. Success Metrics

### 11.1. Functional Metrics
- [ ] `.txt` files indexed without errors
- [ ] Chunks sizes within configured bounds (200-2000 chars)
- [ ] Search queries return relevant `.txt` chunks
- [ ] Encoding fallback handles non-UTF-8 files

### 11.2. Performance Metrics
- [ ] Indexing time for `.txt` matches markdown (±10%)
- [ ] Query latency unchanged (<5ms regression)
- [ ] Memory usage stable (<5% increase)

### 11.3. Quality Metrics
- [ ] Manual review: top-3 results for 10 queries include relevant `.txt` chunks
- [ ] No chunk boundary splits mid-sentence (manual inspection of 20 random chunks)

---

## 12. Appendix: Code Examples

### A. Complete PlainTextParser Implementation

```python
# src/parsers/plaintext.py
import logging
from datetime import datetime, timezone
from pathlib import Path

from src.models import Document
from src.parsers.base import DocumentParser

logger = logging.getLogger(__name__)


class PlainTextParser(DocumentParser):
    """
    Parser for plain text (.txt) files.

    Supports UTF-8 and common fallback encodings (latin-1, cp1252, iso-8859-1).
    Returns Document with minimal metadata (no links, tags, or frontmatter).
    """

    def parse(self, file_path: str) -> Document:
        path = Path(file_path)

        if not path.exists():
            raise FileNotFoundError(f"File not found: {file_path}")

        # Try UTF-8 first, fall back to common encodings
        content = None
        encoding_used = None
        for encoding in ["utf-8", "latin-1", "cp1252", "iso-8859-1"]:
            try:
                content = path.read_text(encoding=encoding, errors="strict")
                encoding_used = encoding
                break
            except (UnicodeDecodeError, LookupError):
                continue

        if content is None:
            raise UnicodeDecodeError(
                "utf-8", b"", 0, 1,
                f"Could not decode {file_path} with any supported encoding"
            )

        if encoding_used != "utf-8":
            logger.warning(f"File {file_path} decoded with {encoding_used} encoding")

        modified_time = datetime.fromtimestamp(path.stat().st_mtime, tz=timezone.utc)

        return Document(
            content=content,
            metadata={
                "source": str(path),
                "encoding": encoding_used,
            },
            links=[],  # No link extraction for plain text
            tags=[],   # No frontmatter tags
            file_path=str(path),
            modified_time=modified_time,
        )
```

### B. Updated Dispatcher

```python
# src/parsers/dispatcher.py
from pathlib import PurePosixPath

from src.config import Config
from src.parsers.markdown import MarkdownParser
from src.parsers.plaintext import PlainTextParser


def dispatch_parser(file_path: str, config: Config):
    """
    Route file to appropriate parser based on config patterns.

    Supports:
    - MarkdownParser: **/*.md
    - PlainTextParser: **/*.txt
    """
    path = PurePosixPath(file_path)

    for pattern in config.parsers.keys():
        parser_name = config.parsers[pattern]

        # Try direct match
        try:
            if path.match(pattern):
                return _instantiate_parser(parser_name, file_path)
        except ValueError:
            pass

        # Try simplified pattern (strip **/)
        if pattern.startswith("**/"):
            simple_pattern = pattern.replace("**/", "")
            try:
                if path.match(simple_pattern):
                    return _instantiate_parser(parser_name, file_path)
            except ValueError:
                pass

    raise ValueError(f"No parser registered for file: {file_path}")


def _instantiate_parser(parser_name: str, file_path: str):
    """Factory function to create parser instances."""
    if parser_name == "MarkdownParser":
        return MarkdownParser()
    elif parser_name == "PlainTextParser":
        return PlainTextParser()
    else:
        raise ValueError(f"Unknown parser: {parser_name} for {file_path}")
```

---

## 13. References

### Internal Documents
- [specs/08-document-chunking.md](./08-document-chunking.md) - Chunking architecture
- [docs/development.md](../docs/development.md) - Adding new parsers
- [src/chunking/header_chunker.py](../src/chunking/header_chunker.py) - Existing chunking logic

### External Resources
- Python `re` module: https://docs.python.org/3/library/re.html
- Unicode encodings: https://docs.python.org/3/howto/unicode.html
- tree-sitter markdown: https://github.com/tree-sitter-grammars/tree-sitter-markdown

---

**Document Version:** 1.0
**Author:** GitHub Copilot (Planner Mode)
**Last Updated:** 2026-01-08
**Status:** Ready for Implementation

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/andnp/ragdocs-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

13-txt-file-chunking.md•23.4 KiB

# 13. Plain Text (.txt) File Chunking

## Executive Summary

This spec proposes adding `.txt` file indexing support to mcp-markdown-ragdocs. Unlike markdown files with header-based structure, plain text lacks semantic boundaries. Four chunking strategies were evaluated: paragraph-based (current fallback), sentence-based, semantic boundary detection, and sliding window. **Paragraph-based chunking** (Strategy 1) is selected as the optimal approach due to its simplicity (15 LOC, reuses existing logic), zero dependencies, and proven effectiveness in the current `_chunk_plain_text()` fallback. Implementation requires only parser registration and configuration—no new chunking code needed.

**Timeline:** 1-2 hours. **Risk:** Low (existing code path).

---

## 1. Goals & Non-Goals

### Goals
- Index `.txt` files with same search quality as markdown
- Reuse existing chunking infrastructure (minimal code duplication)
- Maintain configuration consistency (same min/max/overlap parameters)
- Support all current features (semantic search, keyword search, graph traversal)

### Non-Goals
- Natural language processing (NLP) for sentence boundaries (avoid spaCy/NLTK dependencies)
- Language-specific tokenization (assume UTF-8 text)
- Custom chunking strategies per file extension (single strategy for all `.txt`)
- Retroactive re-chunking of existing markdown files

---

## 2. Current State Analysis

### 2.1. Architecture Overview

**Chunking Pipeline:**
```
Parser → Chunker → IndexManager → [VectorIndex, KeywordIndex, GraphStore]
```

**Key Components:**

| Component | File | Role |
|-----------|------|------|
| `ChunkingStrategy` | [src/chunking/base.py](../src/chunking/base.py) | Abstract protocol defining `chunk_document()` |
| `HeaderBasedChunker` | [src/chunking/header_chunker.py](../src/chunking/header_chunker.py) | Markdown-specific implementation (292 LOC) |
| `get_chunker()` | [src/chunking/factory.py](../src/chunking/factory.py) | Factory function returning strategy based on config |
| `IndexManager` | [src/indexing/manager.py](../src/indexing/manager.py) | Coordinates parsing, chunking, and index updates |
| `dispatch_parser()` | [src/parsers/dispatcher.py](../src/parsers/dispatcher.py) | Routes file extensions to parsers |

**Configuration:**
```python
@dataclass
class ChunkingConfig:
    strategy: str = "header_based"
    min_chunk_chars: int = 200
    max_chunk_chars: int = 2000
    overlap_chars: int = 100
    include_parent_headers: bool = True
    parent_retrieval_enabled: bool = False
    parent_chunk_min_chars: int = 1500
    parent_chunk_max_chars: int = 2000
```

### 2.2. Existing Plain Text Handling

**Fallback Behavior:**
`HeaderBasedChunker` already handles plain text via `_chunk_plain_text()` (lines 274-347):
- Splits at paragraph boundaries (`\n\n+`)
- Respects `min_chunk_chars` and `max_chunk_chars`
- Tracks `start_pos` and `end_pos` for each chunk
- Sets `header_path = ""` (no header hierarchy)

**Limitations:**
- Only triggered when markdown has **no headers**
- No explicit support for `.txt` files (dispatcher only registers markdown)

**Opportunity:**
`.txt` files can reuse `_chunk_plain_text()` with **zero code changes**. Only parser registration needed.

---

## 3. Alternative Chunking Strategies

### Strategy 1: Paragraph-Based (RECOMMENDED) ✅

**Description:**
Split text at double newlines (`\n\n+`). Combine paragraphs until `max_chunk_chars` reached.

**Implementation:**
Reuse existing `HeaderBasedChunker._chunk_plain_text()`:
```python
# src/chunking/header_chunker.py (lines 274-347)
paragraphs = re.split(r'\n\n+', content)
```

**Pros:**
- ✅ **Zero new code** (reuses existing logic)
- ✅ Natural semantic boundaries (paragraphs = coherent units)
- ✅ Respects size constraints (min/max/overlap)
- ✅ No dependencies (stdlib `re` module)
- ✅ Already tested via markdown fallback path

**Cons:**
- ⚠️ Poor for dense text without paragraphs (e.g., logs)
- ⚠️ Fails on non-paragraph structured text (lists, tables)

**Search Quality Impact:** **High** (proven in production for headerless markdown)

**Implementation Complexity:** **15 LOC** (parser + config only)

**Impact/Cost Ratio:** **5.0** (best)

---

### Strategy 2: Sentence-Based

**Description:**
Split text into sentences using regex or NLP library (spaCy/NLTK). Group sentences into chunks.

**Implementation:**
```python
def _chunk_by_sentences(self, document: Document) -> list[Chunk]:
    # Sentence boundary detection
    sentences = re.split(r'(?<=[.!?])\s+', document.content)

    chunks = []
    current = ""
    for sent in sentences:
        if len(current) + len(sent) <= self.config.max_chunk_chars:
            current += " " + sent
        else:
            chunks.append(current)
            current = sent
    return chunks
```

**Pros:**
- ✅ Finer-grained chunks (better for QA)
- ✅ Natural linguistic boundaries

**Cons:**
- ❌ Regex sentence splitting unreliable (abbreviations, URLs, code)
- ❌ NLP libraries add heavy dependencies (spaCy: 100MB+)
- ❌ Slower processing (tokenization overhead)
- ❌ Poor for non-prose text (logs, config files)

**Search Quality Impact:** **Medium** (marginal improvement over paragraphs)

**Implementation Complexity:** **80 LOC** (sentence detection + chunking)

**Impact/Cost Ratio:** **0.5**

---

### Strategy 3: Semantic Boundary Detection

**Description:**
Use embeddings to detect topic shifts. Split when cosine similarity between adjacent paragraphs drops below threshold.

**Implementation:**
```python
def _chunk_by_semantic_boundaries(self, document: Document) -> list[Chunk]:
    paragraphs = re.split(r'\n\n+', document.content)
    embeddings = [self.model.embed(p) for p in paragraphs]

    chunks = []
    current_chunk = [paragraphs[0]]

    for i in range(1, len(paragraphs)):
        similarity = cosine_similarity(embeddings[i-1], embeddings[i])
        if similarity < 0.7:  # Topic shift threshold
            chunks.append("\n\n".join(current_chunk))
            current_chunk = [paragraphs[i]]
        else:
            current_chunk.append(paragraphs[i])

    return chunks
```

**Pros:**
- ✅ Semantically coherent chunks
- ✅ Adapts to content structure

**Cons:**
- ❌ Requires embedding model at index time (2x indexing time)
- ❌ Threshold tuning needed per corpus
- ❌ Poor for small documents (not enough paragraphs)
- ❌ Computationally expensive (O(n) embeddings per document)

**Search Quality Impact:** **High** (optimal coherence)

**Implementation Complexity:** **150 LOC** (embedding pipeline + boundary detection)

**Impact/Cost Ratio:** **0.3**

---

### Strategy 4: Sliding Window

**Description:**
Fixed-size chunks with configurable overlap. No semantic awareness.

**Implementation:**
```python
def _chunk_by_sliding_window(self, document: Document) -> list[Chunk]:
    content = document.content
    chunk_size = self.config.max_chunk_chars
    overlap = self.config.overlap_chars

    chunks = []
    start = 0
    while start < len(content):
        end = start + chunk_size
        chunks.append(content[start:end])
        start += (chunk_size - overlap)

    return chunks
```

**Pros:**
- ✅ Simplest implementation (20 LOC)
- ✅ Predictable chunk sizes
- ✅ No dependencies

**Cons:**
- ❌ Breaks mid-sentence (poor readability)
- ❌ Ignores paragraph boundaries (low coherence)
- ❌ Suboptimal for search (random splits)

**Search Quality Impact:** **Low** (degrades relevance)

**Implementation Complexity:** **20 LOC**

**Impact/Cost Ratio:** **0.2**

---

## 4. Decision Matrix

| Strategy | Search Quality | Complexity (LOC) | Dependencies | Speed | Impact/Cost |
|----------|----------------|------------------|--------------|-------|-------------|
| **1. Paragraph-Based** ✅ | ⭐⭐⭐⭐ High | **15** | None | Fast | **5.0** |
| 2. Sentence-Based | ⭐⭐⭐ Medium | 80 | spaCy/NLTK | Slow | 0.5 |
| 3. Semantic Boundaries | ⭐⭐⭐⭐⭐ Highest | 150 | Embedding model | Very Slow | 0.3 |
| 4. Sliding Window | ⭐⭐ Low | 20 | None | Fast | 0.2 |

**Selected Strategy:** **Paragraph-Based** (Strategy 1)

**Rationale:**
1. **Proven reliability:** Already deployed in `_chunk_plain_text()` fallback
2. **Minimal risk:** Zero new chunking logic (only parser registration)
3. **Cost efficiency:** 15 LOC vs. 80-150 LOC for alternatives
4. **No dependencies:** Stdlib only (no spaCy/NLTK overhead)
5. **User preference alignment:** "simplicity, real implementation over complexity"

---

## 5. Proposed Solution

### 5.1. High-Level Design

**Approach:** Register a `PlainTextParser` that outputs `Document` objects, then reuse `HeaderBasedChunker._chunk_plain_text()` via existing fallback path.

**Key Insight:** `.txt` files are semantically equivalent to "markdown with no headers"—the chunking logic is identical.

### 5.2. Component Changes

#### 5.2.1. New Parser (NEW)

```python
# src/parsers/plaintext.py (~30 LOC)
from pathlib import Path
from datetime import datetime, timezone
from src.models import Document
from src.parsers.base import DocumentParser


class PlainTextParser(DocumentParser):
    """Parser for plain text (.txt) files."""

    def parse(self, file_path: str) -> Document:
        path = Path(file_path)

        if not path.exists():
            raise FileNotFoundError(f"File not found: {file_path}")

        # Try UTF-8 first, fall back to common encodings
        content = None
        for encoding in ["utf-8", "latin-1", "cp1252", "iso-8859-1"]:
            try:
                content = path.read_text(encoding=encoding, errors="strict")
                break
            except (UnicodeDecodeError, LookupError):
                continue

        if content is None:
            raise UnicodeDecodeError("utf-8", b"", 0, 1, f"Cannot decode {file_path}")

        modified_time = datetime.fromtimestamp(path.stat().st_mtime, tz=timezone.utc)

        return Document(
            content=content,
            metadata={"source": str(path)},
            links=[],  # No link extraction for plain text
            tags=[],   # No tag extraction
            file_path=str(path),
            modified_time=modified_time,
        )
```

**Design Notes:**
- Follows `MarkdownParser` encoding fallback pattern (lines 34-48)
- No frontmatter parsing (plain text has no YAML headers)
- No link extraction (no markdown syntax)
- Minimal metadata (source file path only)

#### 5.2.2. Parser Registration

```python
# src/parsers/__init__.py
from src.parsers.markdown import MarkdownParser
from src.parsers.plaintext import PlainTextParser

__all__ = ["MarkdownParser", "PlainTextParser"]
```

```python
# src/parsers/dispatcher.py (modify dispatch_parser function)
from src.parsers.plaintext import PlainTextParser

def dispatch_parser(file_path: str, config: Config):
    path = PurePosixPath(file_path)

    for pattern in config.parsers.keys():
        parser_name = config.parsers[pattern]

        # Match against pattern
        if path.match(pattern) or (pattern.startswith("**/") and path.match(pattern.replace("**/", ""))):
            if parser_name == "MarkdownParser":
                return MarkdownParser()
            elif parser_name == "PlainTextParser":
                return PlainTextParser()

    raise ValueError(f"No parser registered for file: {file_path}")
```

#### 5.2.3. Configuration

```toml
# config.toml (add to [parsers] section)
[parsers]
"**/*.md" = "MarkdownParser"
"**/*.txt" = "PlainTextParser"  # NEW
```

**Rebuild Trigger:**
Adding `"**/*.txt"` to parsers config will trigger index rebuild (manifest version mismatch).

#### 5.2.4. Chunking Integration

**No code changes needed.** `HeaderBasedChunker.chunk_document()` already handles documents without headers:

```python
# src/chunking/header_chunker.py (lines 28-44, EXISTING CODE)
def chunk_document(self, document: Document) -> list[Chunk]:
    content_bytes = bytes(document.content, "utf8")
    tree = self.parser.parse(content_bytes)  # Parses as markdown
    root_node = tree.root_node

    headers = self._extract_headers(root_node, content_bytes)

    if not headers:  # ← Plain text triggers this path
        return self._chunk_plain_text(document)  # ← Reuses existing logic

    # ... markdown chunking ...
```

**Behavior:**
1. `PlainTextParser` returns `Document` with plain text content
2. `HeaderBasedChunker` attempts to parse as markdown (tree-sitter)
3. No headers found (plain text is valid markdown with no headers)
4. Falls back to `_chunk_plain_text()` (paragraph-based chunking)

**Validation:**
Plain text follows identical chunking path as headerless markdown—this code path is already tested and deployed.

---

## 6. Implementation Plan

### Phase 1: Parser Implementation (30 min)

**Tasks:**
- [ ] Create `src/parsers/plaintext.py`
- [ ] Implement `PlainTextParser.parse()`
- [ ] Add UTF-8 + fallback encoding support
- [ ] Add to `src/parsers/__init__.py` exports

**Acceptance Criteria:**
- `PlainTextParser().parse("test.txt")` returns valid `Document`
- Handles UTF-8 and latin-1 encodings
- Modified time preserved from file metadata

### Phase 2: Parser Registration (15 min)

**Tasks:**
- [ ] Modify `src/parsers/dispatcher.py` to handle parser name lookup
- [ ] Add `PlainTextParser` import and dispatch logic
- [ ] Update default config template with `"**/*.txt"` pattern

**Acceptance Criteria:**
- `dispatch_parser("doc.txt", config)` returns `PlainTextParser`
- `dispatch_parser("doc.md", config)` still returns `MarkdownParser`

### Phase 3: Testing (30 min)

**Tasks:**
- [ ] Unit test: `test_plaintext_parser()` (parse, encoding, errors)
- [ ] Integration test: `test_txt_file_indexing()` (end-to-end)
- [ ] Test chunk sizes match config (min/max/overlap)
- [ ] Test search retrieval from `.txt` chunks

**Test Cases:**

| Test | Input | Expected Output |
|------|-------|-----------------|
| `test_parse_utf8` | `hello.txt` (UTF-8) | `Document(content="hello world", ...)` |
| `test_parse_latin1` | `legacy.txt` (latin-1) | `Document(...)` with fallback encoding |
| `test_chunk_paragraphs` | Multi-paragraph `.txt` | Chunks split at `\n\n`, sizes within bounds |
| `test_search_txt_chunks` | Query + indexed `.txt` | Returns relevant chunks with scores |

### Phase 4: Documentation (15 min)

**Tasks:**
- [ ] Update `docs/configuration.md` with `.txt` parser example
- [ ] Add `.txt` support to `README.md` features list
- [ ] Document encoding handling in `docs/development.md`

---

## 7. Testing Strategy

### 7.1. Unit Tests

**File:** `tests/unit/test_plaintext_parser.py` (~50 LOC)

```python
import pytest
from pathlib import Path
from src.parsers.plaintext import PlainTextParser

def test_parse_simple_text(tmp_path):
    file = tmp_path / "test.txt"
    file.write_text("Hello world\n\nSecond paragraph")

    parser = PlainTextParser()
    doc = parser.parse(str(file))

    assert doc.content == "Hello world\n\nSecond paragraph"
    assert doc.metadata["source"] == str(file)
    assert doc.links == []
    assert doc.tags == []

def test_parse_encoding_fallback(tmp_path):
    file = tmp_path / "latin.txt"
    file.write_bytes(b"Caf\xe9")  # Latin-1 encoded "Café"

    parser = PlainTextParser()
    doc = parser.parse(str(file))

    assert "Caf" in doc.content  # Successfully decoded

def test_parse_nonexistent_file():
    parser = PlainTextParser()
    with pytest.raises(FileNotFoundError):
        parser.parse("/nonexistent/file.txt")
```

### 7.2. Integration Tests

**File:** `tests/integration/test_txt_indexing.py` (~80 LOC)

```python
def test_index_and_search_txt_file(tmp_path, index_manager, config):
    # Create test .txt file
    txt_file = tmp_path / "notes.txt"
    txt_file.write_text(
        "Machine Learning Notes\n\n"
        "Neural networks are computational models.\n\n"
        "Training requires labeled data."
    )

    # Index document
    index_manager.index_document(str(txt_file))

    # Query
    results = index_manager.vector.search("neural networks", top_k=5)

    # Verify chunk retrieval
    assert len(results) > 0
    chunk = results[0]
    assert "neural networks" in chunk.content.lower()
    assert chunk.header_path == ""  # No headers in plain text

def test_txt_chunking_respects_size_limits(tmp_path, config):
    # Create large plain text file
    txt_file = tmp_path / "large.txt"
    paragraphs = [f"Paragraph {i} content." * 50 for i in range(100)]
    txt_file.write_text("\n\n".join(paragraphs))

    # Parse and chunk
    parser = PlainTextParser()
    doc = parser.parse(str(txt_file))

    chunker = HeaderBasedChunker(config.chunking)
    chunks = chunker.chunk_document(doc)

    # Verify constraints
    for chunk in chunks:
        assert len(chunk.content) >= config.chunking.min_chunk_chars
        assert len(chunk.content) <= config.chunking.max_chunk_chars
```

### 7.3. Performance Validation

**Benchmark:** Index 100 `.txt` files (1-10KB each), measure:
- Indexing time (should match markdown ±10%)
- Query latency (no degradation)
- Memory usage (no significant increase)

---

## 8. Risks & Mitigations

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| Encoding issues with non-UTF-8 files | Medium | Medium | Fallback encoding sequence (UTF-8 → latin-1 → cp1252) |
| Poor chunking for dense text (logs) | Low | Low | Document as limitation; consider sliding window in future |
| Index rebuild required | High | Low | Automatic on config change (existing mechanism) |
| User confusion (why rebuild?) | Medium | Low | Log clear message: "New parser added, rebuilding index..." |

---

## 9. File Manifest

### Files to Create
- [x] `src/parsers/plaintext.py` (~30 LOC)
- [x] `tests/unit/test_plaintext_parser.py` (~50 LOC)
- [x] `tests/integration/test_txt_indexing.py` (~80 LOC)

### Files to Modify
- [x] `src/parsers/__init__.py` (+2 LOC: import + export)
- [x] `src/parsers/dispatcher.py` (+8 LOC: parser name dispatch)
- [x] `config.toml` (+1 LOC: `.txt` pattern)
- [x] `docs/configuration.md` (+15 LOC: parser example)
- [x] `docs/development.md` (+10 LOC: encoding notes)
- [x] `README.md` (+1 LOC: features list)

**Total LOC:** ~180 (90% tests, 10% production)

---

## 10. Future Enhancements (Out of Scope)

### 10.1. File Type-Specific Strategies
Allow different chunking strategies per file type:
```toml
[chunking.strategies]
"**/*.md" = "header_based"
"**/*.txt" = "paragraph_based"
"**/*.log" = "sliding_window"
```

**Complexity:** High (requires refactoring `get_chunker()` and config schema)

**Value:** Medium (most `.txt` files are prose; logs are rare)

### 10.2. Sentence-Based Chunking
Add optional sentence-level chunking for QA workloads:
```toml
[chunking]
strategy = "sentence_based"  # Requires spaCy
```

**Complexity:** High (80 LOC + spaCy dependency)

**Value:** Low (marginal search quality improvement)

### 10.3. Metadata Extraction
Extract dates, URLs, or email addresses from plain text:
```python
metadata = {
    "dates": ["2025-01-08"],
    "urls": ["https://example.com"],
    "emails": ["user@example.com"]
}
```

**Complexity:** Medium (regex patterns + extraction logic)

**Value:** Medium (useful for certain corpora, e.g., logs or emails)

---

## 11. Success Metrics

### 11.1. Functional Metrics
- [ ] `.txt` files indexed without errors
- [ ] Chunks sizes within configured bounds (200-2000 chars)
- [ ] Search queries return relevant `.txt` chunks
- [ ] Encoding fallback handles non-UTF-8 files

### 11.2. Performance Metrics
- [ ] Indexing time for `.txt` matches markdown (±10%)
- [ ] Query latency unchanged (<5ms regression)
- [ ] Memory usage stable (<5% increase)

### 11.3. Quality Metrics
- [ ] Manual review: top-3 results for 10 queries include relevant `.txt` chunks
- [ ] No chunk boundary splits mid-sentence (manual inspection of 20 random chunks)

---

## 12. Appendix: Code Examples

### A. Complete PlainTextParser Implementation

```python
# src/parsers/plaintext.py
import logging
from datetime import datetime, timezone
from pathlib import Path

from src.models import Document
from src.parsers.base import DocumentParser

logger = logging.getLogger(__name__)


class PlainTextParser(DocumentParser):
    """
    Parser for plain text (.txt) files.

    Supports UTF-8 and common fallback encodings (latin-1, cp1252, iso-8859-1).
    Returns Document with minimal metadata (no links, tags, or frontmatter).
    """

    def parse(self, file_path: str) -> Document:
        path = Path(file_path)

        if not path.exists():
            raise FileNotFoundError(f"File not found: {file_path}")

        # Try UTF-8 first, fall back to common encodings
        content = None
        encoding_used = None
        for encoding in ["utf-8", "latin-1", "cp1252", "iso-8859-1"]:
            try:
                content = path.read_text(encoding=encoding, errors="strict")
                encoding_used = encoding
                break
            except (UnicodeDecodeError, LookupError):
                continue

        if content is None:
            raise UnicodeDecodeError(
                "utf-8", b"", 0, 1,
                f"Could not decode {file_path} with any supported encoding"
            )

        if encoding_used != "utf-8":
            logger.warning(f"File {file_path} decoded with {encoding_used} encoding")

        modified_time = datetime.fromtimestamp(path.stat().st_mtime, tz=timezone.utc)

        return Document(
            content=content,
            metadata={
                "source": str(path),
                "encoding": encoding_used,
            },
            links=[],  # No link extraction for plain text
            tags=[],   # No frontmatter tags
            file_path=str(path),
            modified_time=modified_time,
        )
```

### B. Updated Dispatcher

```python
# src/parsers/dispatcher.py
from pathlib import PurePosixPath

from src.config import Config
from src.parsers.markdown import MarkdownParser
from src.parsers.plaintext import PlainTextParser


def dispatch_parser(file_path: str, config: Config):
    """
    Route file to appropriate parser based on config patterns.

    Supports:
    - MarkdownParser: **/*.md
    - PlainTextParser: **/*.txt
    """
    path = PurePosixPath(file_path)

    for pattern in config.parsers.keys():
        parser_name = config.parsers[pattern]

        # Try direct match
        try:
            if path.match(pattern):
                return _instantiate_parser(parser_name, file_path)
        except ValueError:
            pass

        # Try simplified pattern (strip **/)
        if pattern.startswith("**/"):
            simple_pattern = pattern.replace("**/", "")
            try:
                if path.match(simple_pattern):
                    return _instantiate_parser(parser_name, file_path)
            except ValueError:
                pass

    raise ValueError(f"No parser registered for file: {file_path}")


def _instantiate_parser(parser_name: str, file_path: str):
    """Factory function to create parser instances."""
    if parser_name == "MarkdownParser":
        return MarkdownParser()
    elif parser_name == "PlainTextParser":
        return PlainTextParser()
    else:
        raise ValueError(f"Unknown parser: {parser_name} for {file_path}")
```

---

## 13. References

### Internal Documents
- [specs/08-document-chunking.md](./08-document-chunking.md) - Chunking architecture
- [docs/development.md](../docs/development.md) - Adding new parsers
- [src/chunking/header_chunker.py](../src/chunking/header_chunker.py) - Existing chunking logic

### External Resources
- Python `re` module: https://docs.python.org/3/library/re.html
- Unicode encodings: https://docs.python.org/3/howto/unicode.html
- tree-sitter markdown: https://github.com/tree-sitter-grammars/tree-sitter-markdown

---

**Document Version:** 1.0
**Author:** GitHub Copilot (Planner Mode)
**Last Updated:** 2026-01-08
**Status:** Ready for Implementation