11-section-id-brainstorm.mdā¢13.4 kB
# Section ID System Analysis and Improvement Brainstorm
## Current Section ID System Analysis
### How it Currently Works
The current section ID system in `src/quantalogic_markdown_mcp/mcp_server.py` uses the `SafeMarkdownEditor` class, which generates section IDs through the `_generate_section_id` method:
```python
def _generate_section_id(self, heading: Dict[str, Any], line_start: int) -> str:
"""Generate stable section ID."""
# Use title, level, and position to create stable hash
content = f"{heading['content']}:{heading['level']}:{line_start}"
hash_value = hashlib.md5(content.encode()).hexdigest()[:8]
return f"section_{hash_value}"
```
### Current System Characteristics
**Input Components:**
- `heading['content']` - The heading text/title
- `heading['level']` - The heading level (1-6)
- `line_start` - The 0-indexed line number where the section starts
**Output Format:** `section_{8-char-md5-hash}`
**Examples from Testing:**
```text
section_6c3c9200 - "Test Document" (level 1, line 0)
section_74641f7f - "Introduction" (level 2, line 2)
section_9f3d1df6 - "Features" (level 2, line 5)
section_58663e5d - "Introduction" (level 2, line 8) # Duplicate title, different ID
section_fedd0912 - "Installation" (level 2, line 11)
section_faede70e - "Introduction" (level 2, line 14) # Another duplicate, different ID
```
### Strengths of Current Approach
1. **Handles Duplicate Titles**: Multiple sections with the same title get different IDs due to different line positions
2. **Deterministic**: Same content + level + position = same ID
3. **Compact**: 8-character hash keeps IDs short
4. **Stable During Content Updates**: ID doesn't change when section content is updated (only title, level, or position changes trigger new ID)
5. **URL-Safe**: Uses only alphanumeric characters
### Weaknesses of Current Approach
1. **Position Dependency**: Moving sections changes their IDs, breaking external references
2. **Line Number Fragility**: Any edit that shifts line numbers invalidates all subsequent section IDs
3. **Hash Collisions**: Though rare, MD5 truncated to 8 chars could theoretically collide
4. **No Semantic Meaning**: IDs give no hint about section content or structure
5. **Document Restructuring Impact**: Adding/removing sections above a target section breaks its ID
6. **Not Human-Readable**: `section_74641f7f` tells you nothing about the section
## Brainstorming Alternative Approaches
### Option 1: Hierarchical Path-Based IDs
**Concept:** Use the section's hierarchical path as the primary identifier.
**Format:** `{parent-slug}/{section-slug}` or `{level1}/{level2}/{level3}`
**Examples:**
```text
"introduction"
"getting-started/installation"
"getting-started/quick-start"
"api-reference/authentication"
"api-reference/endpoints/users"
```
**Pros:**
- Human-readable and semantic
- Reflects document structure
- Stable unless section is moved between parents
- Easy to understand relationships
**Cons:**
- Still changes when parent sections are renamed
- Requires slug generation (handling special chars, spaces)
- Potential conflicts with same-named siblings
- Can become very long for deeply nested sections
### Option 2: UUID-Based Persistent IDs
**Concept:** Generate a UUID for each section when first created, store persistently.
**Format:** `section-{uuid4}` or just `{uuid4}`
**Examples:**
```text
"section-f47ac10b-58cc-4372-a567-0e02b2c3d479"
"a1b2c3d4-e5f6-7890-abcd-ef1234567890"
```
**Implementation:** Could use metadata comments in markdown:
```markdown
<!-- section-id: f47ac10b-58cc-4372-a567-0e02b2c3d479 -->
## Introduction
This is the introduction section.
```
**Pros:**
- Truly stable - never changes regardless of document edits
- No collisions (practically impossible)
- Works with any section title/structure changes
- No dependency on content or position
**Cons:**
- Not human-readable at all
- Requires storing metadata in document or external system
- Makes debugging/manual editing harder
- Long IDs (36 chars)
### Option 3: Content-Hash Based IDs
**Concept:** Generate ID based on section content rather than position.
**Format:** `section-{content-hash}`
```python
def generate_content_based_id(title, content):
# Use title + actual content, not position
combined = f"{title}:{content.strip()}"
hash_value = hashlib.sha256(combined.encode()).hexdigest()[:12]
return f"section-{hash_value}"
```
**Pros:**
- Stable as long as content doesn't change
- Position independent
- Deterministic based on actual content
**Cons:**
- Changes when section content is edited
- Potential issues with whitespace/formatting changes
- Still possible collisions
- Not human-friendly
### Option 4: Sequence Number + Slug Hybrid
**Concept:** Combine a sequential number with a slug derived from the title.
**Format:** `{sequence-number}-{title-slug}`
**Examples:**
```text
"001-introduction"
"002-getting-started"
"003-installation"
"004-introduction-advanced" # Handle duplicates with suffix
```
**Pros:**
- Human-readable
- Sequential numbers provide natural ordering
- Slugs provide semantic meaning
- Relatively stable
**Cons:**
- Sequence numbers change when sections are inserted/moved
- Still need duplicate handling
- Manual renumbering needed for reorganization
### Option 5: Title-Based with Collision Resolution
**Concept:** Use title as primary ID, resolve collisions with suffixes.
**Format:** `{title-slug}` or `{title-slug}-{counter}`
```python
def generate_title_based_id(title, existing_ids):
slug = title.lower().replace(' ', '-').replace(/[^a-z0-9-]/g, '')
if slug not in existing_ids:
return slug
counter = 2
while f"{slug}-{counter}" in existing_ids:
counter += 1
return f"{slug}-{counter}"
```
**Examples:**
```text
"introduction"
"getting-started"
"installation"
"introduction-2" # First duplicate
"introduction-3" # Second duplicate
```
**Pros:**
- Very human-readable
- Semantic meaning
- Clean URLs if used in web context
- Predictable naming
**Cons:**
- IDs can still change if duplicate resolution changes
- Requires maintaining collision state
- Issues with special characters in titles
### Option 6: Hybrid Approach - Multiple ID Types
**Concept:** Support multiple ID types with automatic fallback.
```python
class SectionReference:
# Multiple ID types for different use cases
uuid: str # Permanent, stable ID
title_slug: str # Human-readable ID
legacy_hash: str # Current MD5-based ID for compatibility
path_id: str # Hierarchical path ID
```
**Usage:**
```python
# Get section by any ID type
section = editor.get_section_by_id("introduction") # title slug
section = editor.get_section_by_id("getting-started/introduction") # path
section = editor.get_section_by_id("section_74641f7f") # legacy hash
section = editor.get_section_by_id("f47ac10b-58cc-4372-a567-0e02b2c3d479") # UUID
```
**Pros:**
- Best of all worlds
- Backward compatibility
- Flexibility for different use cases
- Graceful migration path
**Cons:**
- Complex implementation
- Multiple IDs to maintain
- Potential confusion about which ID to use
## Recommended Solution: Enhanced Title-Based with Smart Collision Resolution
### Core Design
Based on the analysis, I recommend **Option 5 with enhancements**:
**Primary ID:** Human-readable slug derived from section title
**Collision Resolution:** Intelligent suffix system
**Fallback:** Position-based hash for edge cases
### Implementation Strategy
```python
class ImprovedSectionIDGenerator:
def __init__(self):
self.id_registry = {} # Track used IDs
def generate_section_id(self, title: str, level: int, line_start: int,
existing_sections: List[SectionReference]) -> str:
"""Generate improved section ID with collision resolution."""
# Step 1: Create base slug from title
base_slug = self._create_slug(title)
# Step 2: Try base slug first
if not self._id_exists(base_slug, existing_sections):
return base_slug
# Step 3: Smart collision resolution
return self._resolve_collision(base_slug, title, level, line_start, existing_sections)
def _create_slug(self, title: str) -> str:
"""Create URL-friendly slug from title."""
# Convert to lowercase
slug = title.lower()
# Replace spaces and special chars with hyphens
slug = re.sub(r'[^\w\s-]', '', slug) # Remove special chars
slug = re.sub(r'[-\s]+', '-', slug) # Replace spaces/multiple hyphens
slug = slug.strip('-') # Remove leading/trailing hyphens
# Limit length
if len(slug) > 50:
slug = slug[:50].rstrip('-')
return slug or "section" # Fallback for empty slugs
def _resolve_collision(self, base_slug: str, title: str, level: int,
line_start: int, existing_sections: List[SectionReference]) -> str:
"""Intelligently resolve ID collisions."""
# Strategy 1: Add hierarchical context if available
# If this section has parent sections, include them
path_slug = self._try_hierarchical_slug(base_slug, level, existing_sections)
if path_slug and not self._id_exists(path_slug, existing_sections):
return path_slug
# Strategy 2: Add semantic suffixes based on context
context_slug = self._try_context_suffix(base_slug, title, existing_sections)
if context_slug and not self._id_exists(context_slug, existing_sections):
return context_slug
# Strategy 3: Numeric suffix (traditional approach)
counter = 2
while True:
candidate = f"{base_slug}-{counter}"
if not self._id_exists(candidate, existing_sections):
return candidate
counter += 1
# Safety valve - fallback to hash after too many attempts
if counter > 20:
return self._fallback_hash_id(title, level, line_start)
def _try_hierarchical_slug(self, base_slug: str, level: int,
existing_sections: List[SectionReference]) -> Optional[str]:
"""Try to create unique ID using hierarchical context."""
# Find parent section
for section in reversed(existing_sections):
if section.level < level:
parent_slug = section.id
return f"{parent_slug}-{base_slug}"
return None
def _try_context_suffix(self, base_slug: str, title: str,
existing_sections: List[SectionReference]) -> Optional[str]:
"""Add context-aware suffixes."""
# Look for meaningful words that could be suffixes
words = title.lower().split()
# Try important distinguishing words
for word in reversed(words[1:]): # Skip first word (likely in base slug)
if len(word) > 2 and word not in ['the', 'and', 'for', 'with']:
candidate = f"{base_slug}-{word}"
if not self._id_exists(candidate, existing_sections):
return candidate
return None
def _fallback_hash_id(self, title: str, level: int, line_start: int) -> str:
"""Fallback to hash-based ID (current system)."""
content = f"{title}:{level}:{line_start}"
hash_value = hashlib.md5(content.encode()).hexdigest()[:8]
return f"section-{hash_value}"
```
### Migration Strategy
1. **Phase 1:** Implement new ID generation alongside existing system
2. **Phase 2:** Add dual-ID support to `SectionReference`
3. **Phase 3:** Update tools to accept both ID formats
4. **Phase 4:** Generate migration mapping for existing documents
5. **Phase 5:** Deprecate old IDs with grace period
### Benefits of Recommended Approach
1. **Human Readable:** `introduction`, `getting-started`, `api-authentication`
2. **Semantic:** IDs reflect content meaning
3. **Stable:** Less likely to change than position-based IDs
4. **Collision Handling:** Smart resolution with hierarchical context
5. **Migration Path:** Can coexist with current system
6. **URL Friendly:** Great for web applications
7. **Debugging Friendly:** Easy to understand in logs/debugging
### Example Transformations
**Before (Current System):**
```text
section_74641f7f - "Introduction"
section_58663e5d - "Introduction"
section_faede70e - "Introduction"
```
**After (Recommended System):**
```text
introduction
getting-started-introduction # Parent context added
advanced-introduction # Context from title added
```
### Handling Edge Cases
**Empty/Invalid Titles:** Fallback to `section-{hash}`
**Special Characters:** Stripped and replaced with hyphens
**Very Long Titles:** Truncated to 50 characters
**All Duplicates:** Numeric suffix as final fallback
**Non-English Titles:** Unicode normalization and transliteration
## Conclusion
The recommended solution provides a significant improvement over the current hash-based system by making IDs human-readable and more stable, while maintaining backward compatibility and handling edge cases gracefully. The implementation can be done incrementally without disrupting existing functionality.