Skip to main content
Glama
11-section-id-brainstorm.md•13.4 kB
# Section ID System Analysis and Improvement Brainstorm ## Current Section ID System Analysis ### How it Currently Works The current section ID system in `src/quantalogic_markdown_mcp/mcp_server.py` uses the `SafeMarkdownEditor` class, which generates section IDs through the `_generate_section_id` method: ```python def _generate_section_id(self, heading: Dict[str, Any], line_start: int) -> str: """Generate stable section ID.""" # Use title, level, and position to create stable hash content = f"{heading['content']}:{heading['level']}:{line_start}" hash_value = hashlib.md5(content.encode()).hexdigest()[:8] return f"section_{hash_value}" ``` ### Current System Characteristics **Input Components:** - `heading['content']` - The heading text/title - `heading['level']` - The heading level (1-6) - `line_start` - The 0-indexed line number where the section starts **Output Format:** `section_{8-char-md5-hash}` **Examples from Testing:** ```text section_6c3c9200 - "Test Document" (level 1, line 0) section_74641f7f - "Introduction" (level 2, line 2) section_9f3d1df6 - "Features" (level 2, line 5) section_58663e5d - "Introduction" (level 2, line 8) # Duplicate title, different ID section_fedd0912 - "Installation" (level 2, line 11) section_faede70e - "Introduction" (level 2, line 14) # Another duplicate, different ID ``` ### Strengths of Current Approach 1. **Handles Duplicate Titles**: Multiple sections with the same title get different IDs due to different line positions 2. **Deterministic**: Same content + level + position = same ID 3. **Compact**: 8-character hash keeps IDs short 4. **Stable During Content Updates**: ID doesn't change when section content is updated (only title, level, or position changes trigger new ID) 5. **URL-Safe**: Uses only alphanumeric characters ### Weaknesses of Current Approach 1. **Position Dependency**: Moving sections changes their IDs, breaking external references 2. **Line Number Fragility**: Any edit that shifts line numbers invalidates all subsequent section IDs 3. **Hash Collisions**: Though rare, MD5 truncated to 8 chars could theoretically collide 4. **No Semantic Meaning**: IDs give no hint about section content or structure 5. **Document Restructuring Impact**: Adding/removing sections above a target section breaks its ID 6. **Not Human-Readable**: `section_74641f7f` tells you nothing about the section ## Brainstorming Alternative Approaches ### Option 1: Hierarchical Path-Based IDs **Concept:** Use the section's hierarchical path as the primary identifier. **Format:** `{parent-slug}/{section-slug}` or `{level1}/{level2}/{level3}` **Examples:** ```text "introduction" "getting-started/installation" "getting-started/quick-start" "api-reference/authentication" "api-reference/endpoints/users" ``` **Pros:** - Human-readable and semantic - Reflects document structure - Stable unless section is moved between parents - Easy to understand relationships **Cons:** - Still changes when parent sections are renamed - Requires slug generation (handling special chars, spaces) - Potential conflicts with same-named siblings - Can become very long for deeply nested sections ### Option 2: UUID-Based Persistent IDs **Concept:** Generate a UUID for each section when first created, store persistently. **Format:** `section-{uuid4}` or just `{uuid4}` **Examples:** ```text "section-f47ac10b-58cc-4372-a567-0e02b2c3d479" "a1b2c3d4-e5f6-7890-abcd-ef1234567890" ``` **Implementation:** Could use metadata comments in markdown: ```markdown <!-- section-id: f47ac10b-58cc-4372-a567-0e02b2c3d479 --> ## Introduction This is the introduction section. ``` **Pros:** - Truly stable - never changes regardless of document edits - No collisions (practically impossible) - Works with any section title/structure changes - No dependency on content or position **Cons:** - Not human-readable at all - Requires storing metadata in document or external system - Makes debugging/manual editing harder - Long IDs (36 chars) ### Option 3: Content-Hash Based IDs **Concept:** Generate ID based on section content rather than position. **Format:** `section-{content-hash}` ```python def generate_content_based_id(title, content): # Use title + actual content, not position combined = f"{title}:{content.strip()}" hash_value = hashlib.sha256(combined.encode()).hexdigest()[:12] return f"section-{hash_value}" ``` **Pros:** - Stable as long as content doesn't change - Position independent - Deterministic based on actual content **Cons:** - Changes when section content is edited - Potential issues with whitespace/formatting changes - Still possible collisions - Not human-friendly ### Option 4: Sequence Number + Slug Hybrid **Concept:** Combine a sequential number with a slug derived from the title. **Format:** `{sequence-number}-{title-slug}` **Examples:** ```text "001-introduction" "002-getting-started" "003-installation" "004-introduction-advanced" # Handle duplicates with suffix ``` **Pros:** - Human-readable - Sequential numbers provide natural ordering - Slugs provide semantic meaning - Relatively stable **Cons:** - Sequence numbers change when sections are inserted/moved - Still need duplicate handling - Manual renumbering needed for reorganization ### Option 5: Title-Based with Collision Resolution **Concept:** Use title as primary ID, resolve collisions with suffixes. **Format:** `{title-slug}` or `{title-slug}-{counter}` ```python def generate_title_based_id(title, existing_ids): slug = title.lower().replace(' ', '-').replace(/[^a-z0-9-]/g, '') if slug not in existing_ids: return slug counter = 2 while f"{slug}-{counter}" in existing_ids: counter += 1 return f"{slug}-{counter}" ``` **Examples:** ```text "introduction" "getting-started" "installation" "introduction-2" # First duplicate "introduction-3" # Second duplicate ``` **Pros:** - Very human-readable - Semantic meaning - Clean URLs if used in web context - Predictable naming **Cons:** - IDs can still change if duplicate resolution changes - Requires maintaining collision state - Issues with special characters in titles ### Option 6: Hybrid Approach - Multiple ID Types **Concept:** Support multiple ID types with automatic fallback. ```python class SectionReference: # Multiple ID types for different use cases uuid: str # Permanent, stable ID title_slug: str # Human-readable ID legacy_hash: str # Current MD5-based ID for compatibility path_id: str # Hierarchical path ID ``` **Usage:** ```python # Get section by any ID type section = editor.get_section_by_id("introduction") # title slug section = editor.get_section_by_id("getting-started/introduction") # path section = editor.get_section_by_id("section_74641f7f") # legacy hash section = editor.get_section_by_id("f47ac10b-58cc-4372-a567-0e02b2c3d479") # UUID ``` **Pros:** - Best of all worlds - Backward compatibility - Flexibility for different use cases - Graceful migration path **Cons:** - Complex implementation - Multiple IDs to maintain - Potential confusion about which ID to use ## Recommended Solution: Enhanced Title-Based with Smart Collision Resolution ### Core Design Based on the analysis, I recommend **Option 5 with enhancements**: **Primary ID:** Human-readable slug derived from section title **Collision Resolution:** Intelligent suffix system **Fallback:** Position-based hash for edge cases ### Implementation Strategy ```python class ImprovedSectionIDGenerator: def __init__(self): self.id_registry = {} # Track used IDs def generate_section_id(self, title: str, level: int, line_start: int, existing_sections: List[SectionReference]) -> str: """Generate improved section ID with collision resolution.""" # Step 1: Create base slug from title base_slug = self._create_slug(title) # Step 2: Try base slug first if not self._id_exists(base_slug, existing_sections): return base_slug # Step 3: Smart collision resolution return self._resolve_collision(base_slug, title, level, line_start, existing_sections) def _create_slug(self, title: str) -> str: """Create URL-friendly slug from title.""" # Convert to lowercase slug = title.lower() # Replace spaces and special chars with hyphens slug = re.sub(r'[^\w\s-]', '', slug) # Remove special chars slug = re.sub(r'[-\s]+', '-', slug) # Replace spaces/multiple hyphens slug = slug.strip('-') # Remove leading/trailing hyphens # Limit length if len(slug) > 50: slug = slug[:50].rstrip('-') return slug or "section" # Fallback for empty slugs def _resolve_collision(self, base_slug: str, title: str, level: int, line_start: int, existing_sections: List[SectionReference]) -> str: """Intelligently resolve ID collisions.""" # Strategy 1: Add hierarchical context if available # If this section has parent sections, include them path_slug = self._try_hierarchical_slug(base_slug, level, existing_sections) if path_slug and not self._id_exists(path_slug, existing_sections): return path_slug # Strategy 2: Add semantic suffixes based on context context_slug = self._try_context_suffix(base_slug, title, existing_sections) if context_slug and not self._id_exists(context_slug, existing_sections): return context_slug # Strategy 3: Numeric suffix (traditional approach) counter = 2 while True: candidate = f"{base_slug}-{counter}" if not self._id_exists(candidate, existing_sections): return candidate counter += 1 # Safety valve - fallback to hash after too many attempts if counter > 20: return self._fallback_hash_id(title, level, line_start) def _try_hierarchical_slug(self, base_slug: str, level: int, existing_sections: List[SectionReference]) -> Optional[str]: """Try to create unique ID using hierarchical context.""" # Find parent section for section in reversed(existing_sections): if section.level < level: parent_slug = section.id return f"{parent_slug}-{base_slug}" return None def _try_context_suffix(self, base_slug: str, title: str, existing_sections: List[SectionReference]) -> Optional[str]: """Add context-aware suffixes.""" # Look for meaningful words that could be suffixes words = title.lower().split() # Try important distinguishing words for word in reversed(words[1:]): # Skip first word (likely in base slug) if len(word) > 2 and word not in ['the', 'and', 'for', 'with']: candidate = f"{base_slug}-{word}" if not self._id_exists(candidate, existing_sections): return candidate return None def _fallback_hash_id(self, title: str, level: int, line_start: int) -> str: """Fallback to hash-based ID (current system).""" content = f"{title}:{level}:{line_start}" hash_value = hashlib.md5(content.encode()).hexdigest()[:8] return f"section-{hash_value}" ``` ### Migration Strategy 1. **Phase 1:** Implement new ID generation alongside existing system 2. **Phase 2:** Add dual-ID support to `SectionReference` 3. **Phase 3:** Update tools to accept both ID formats 4. **Phase 4:** Generate migration mapping for existing documents 5. **Phase 5:** Deprecate old IDs with grace period ### Benefits of Recommended Approach 1. **Human Readable:** `introduction`, `getting-started`, `api-authentication` 2. **Semantic:** IDs reflect content meaning 3. **Stable:** Less likely to change than position-based IDs 4. **Collision Handling:** Smart resolution with hierarchical context 5. **Migration Path:** Can coexist with current system 6. **URL Friendly:** Great for web applications 7. **Debugging Friendly:** Easy to understand in logs/debugging ### Example Transformations **Before (Current System):** ```text section_74641f7f - "Introduction" section_58663e5d - "Introduction" section_faede70e - "Introduction" ``` **After (Recommended System):** ```text introduction getting-started-introduction # Parent context added advanced-introduction # Context from title added ``` ### Handling Edge Cases **Empty/Invalid Titles:** Fallback to `section-{hash}` **Special Characters:** Stripped and replaced with hyphens **Very Long Titles:** Truncated to 50 characters **All Duplicates:** Numeric suffix as final fallback **Non-English Titles:** Unicode normalization and transliteration ## Conclusion The recommended solution provides a significant improvement over the current hash-based system by making IDs human-readable and more stable, while maintaining backward compatibility and handling edge cases gracefully. The implementation can be done incrementally without disrupting existing functionality.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/quantalogic/quantalogic_markdown_mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server