Pomera AI Commander

Overview Schema Related Servers Score Discussions

analysis-comparison.md•54.6 KiB

# Analysis & Comparison Tools > Diff viewing, text statistics, word frequency analysis, cron expression parsing, and smart diff. --- ## Analysis & Comparison Tools Documentation ### Diff Viewer **Category**: Analysis & Comparison Tools **Availability**: Always Available **Implementation**: `tools/diff_viewer.py` - `DiffViewerWidget` class **Archive Documentation**: `archive/DIFF_VIEWER_IMPROVEMENTS.md` #### Description The Diff Viewer is a sophisticated text comparison tool that provides side-by-side visual comparison of two text documents with advanced features for precise change detection. It uses Python's difflib.SequenceMatcher algorithm to identify additions, deletions, and modifications between texts, with multiple comparison modes, line filtering capabilities, enhanced statistics, and intelligent highlighting for comprehensive text analysis. #### Key Features - **Side-by-Side Comparison**: Visual side-by-side display of text differences with synchronized scrolling - **Multiple Comparison Modes**: Ignore case, match case, and ignore whitespace options - **Advanced Diff Algorithms**: Uses Python's difflib.SequenceMatcher for accurate line-by-line comparison - **Visual Highlighting**: Color-coded highlighting for additions, deletions, and modifications - **Word-Level Differences**: Highlights specific word changes within modified lines - **Line Filtering (NEW)**: Real-time line filtering with clear button for both input and output panes - **Enhanced Statistics (NEW)**: Comprehensive statistics showing bytes, words, sentences, lines, and tokens - **Tabbed Interface**: Compare multiple document pairs simultaneously (7 tabs) - **Synchronized Scrolling**: Coordinated scrolling between comparison panes with mouse wheel support - **Automatic Filter Clearing**: Filters automatically clear when switching tabs or running new comparisons - **Improved Error Handling**: Robust error handling and state management for edge cases #### Visual Interface Layout ``` ┌─────────────────────────────────┬─────────────────────────────────┐ │ Input │ Output │ │ [📁] [⌫] │ [Send to Input ▼] [⎘] [⌫] │ │ Filter: [___________________] ✕ │ Filter: [___________________] ✕ │ ├─────────────────────────────────┼─────────────────────────────────┤ │ ┌─ Tab 1 ─┬─ Tab 2 ─┬─ Tab 3 ─┐│ ┌─ Tab 1 ─┬─ Tab 2 ─┬─ Tab 3 ─┐│ │ │ ││ │ ││ │ │ Line 1: Hello World ││ │ Line 1: Hello World ││ │ │ Line 2: This is a test ││ │ Line 2: This is a test ││ │ │ Line 3: With differences ││ │ Line 3: With changes ││ │ │ Line 4: Some more text ││ │ Line 4: Some more text ││ │ │ ││ │ Line 5: New line added ││ │ └───────────────────────────────┘│ └───────────────────────────────┘│ ├─────────────────────────────────┼─────────────────────────────────┤ │ Bytes: 1.5K | Word: 234 ... │ Bytes: 2.1K | Word: 312 ... │ └─────────────────────────────────┴─────────────────────────────────┘ ``` **Button Functions:** - **📁** (Input): Load file into active input tab - **⌫** (Input): Clear all input tabs - **✕** (Filter): Clear input filter - **Send to Input ▼** (Output): Copy output to specific input tab - **⎘** (Output): Copy output to clipboard - **⌫** (Output): Clear all output tabs - **✕** (Filter): Clear output filter #### Color-Coded Comparison Results ``` ┌─────────────────────────────────┬─────────────────────────────────┐ │ Input (Original) │ Output (Modified) │ ├─────────────────────────────────┼─────────────────────────────────┤ │ Line 1: Hello World │ Line 1: Hello World │ │ Line 2: This is a test │ Line 2: This is a test │ │ Line 3: With differences [RED] │ Line 3: With changes [GREEN] │ │ Line 4: Some more text │ Line 4: Some more text │ │ [RED] │ Line 5: New line added [GREEN] │ └─────────────────────────────────┴─────────────────────────────────┘ ``` **Color Legend:** - **White**: Unchanged lines (equal content) - **Red**: Deleted lines (only in input/original) - **Green**: Added lines (only in output/modified) - **Blue**: Modified lines (different in both) - **Dark Red**: Deleted words within modified lines - **Dark Green**: Added words within modified lines #### Capabilities ##### Core Functionality - **Line-by-Line Comparison**: Compares texts line by line using difflib.SequenceMatcher for precise difference detection - **Change Detection**: Identifies additions, deletions, and modifications with word-level granularity - **Visual Highlighting**: Color-coded display of different types of changes with inline word highlighting - **Preprocessing Options**: Flexible text preprocessing for different comparison needs (case, whitespace) - **Line Filtering (NEW)**: Real-time filtering to show only lines containing specific text - **Statistics Tracking (NEW)**: Comprehensive statistics for both input and output panes - **Multi-Tab Support**: 7 independent comparison tabs for parallel document analysis - **File Loading**: Load files directly into input tabs for comparison ##### Comparison Modes **Ignore Case:** - Performs case-insensitive comparison by converting text to lowercase - "Hello" and "hello" are treated as identical - Useful for comparing texts where case differences are not significant - Default comparison mode **Match Case:** - Performs case-sensitive comparison using original text - "Hello" and "hello" are treated as different - Useful for precise text comparison where case matters - Ideal for code comparison and formal documents **Ignore Whitespace:** - Normalizes whitespace before comparison using regex `\s+` → single space - Multiple spaces, tabs, and line breaks are normalized - Trailing/leading whitespace is stripped - Useful for comparing formatted text where spacing varies - Ideal for code comparison across different formatting styles ##### Visual Indicators **Color Coding:** - **Light Red (#ffebe9)**: Deleted lines (removed from input) - **Light Green (#e6ffed)**: Added lines (new in output) - **Light Blue (#e6f7ff)**: Modified lines (changed content) - **Darker Red (#ffc9c9)**: Deleted words within modified lines - **Darker Green (#a7f0ba)**: Added words within modified lines - **No Highlighting**: Unchanged lines (identical in both texts) **Change Types:** - **Equal**: Lines that are identical in both texts (no highlighting) - **Delete**: Lines present only in the input text (red background, empty line in output) - **Insert**: Lines present only in the output text (empty line in input, green background) - **Replace**: Lines that exist in both texts but with different content (blue background with word-level highlighting) ##### Line Filtering (NEW) **Filter Features:** - **Real-Time Filtering**: Filter updates as you type in the filter field - **Case-Insensitive Search**: Searches are case-insensitive for better usability - **Clear Button**: Quick clear button (✕) to remove filters instantly - **Auto-Clear on Tab Switch**: Filters automatically clear when switching between tabs - **Auto-Clear on Comparison**: Filters automatically clear when running new comparisons - **Original Content Preservation**: Original content is stored and restored when filter is cleared - **Statistics Update**: Statistics update to reflect filtered content - **Independent Filters**: Separate filters for input and output panes **Filter Behavior:** - Shows only lines containing the filter text - Entire line is included if filter text is found anywhere in the line - Empty filter shows all lines (restores original content) - Filter text is highlighted in the results (visual feedback) ##### Enhanced Statistics (NEW) **Statistics Display:** - **Bytes**: UTF-8 encoded byte count with K/M formatting (e.g., "1.5K", "2.3M") - **Words**: Count of whitespace-separated non-empty strings - **Sentences**: Count of sentence-ending punctuation (`.`, `!`, `?`) with minimum of 1 if text exists - **Lines**: Count of newlines + 1 for non-empty text - **Tokens**: Rough estimate for AI processing (characters / 4) **Statistics Location:** - **Input Stats Bar**: Below input notebook, shows statistics for active input tab - **Output Stats Bar**: Below output notebook, shows statistics for active output tab - **Real-Time Updates**: Statistics update automatically when content changes - **Format**: `Bytes: 1.5K | Word: 234 | Sentence: 12 | Line: 45 | Tokens: 567` ##### Input/Output Specifications - **Input**: Two text documents for comparison (via tabs or file loading) - **Output**: Side-by-side visual diff with color-coded highlighting and statistics - **Tab Count**: 7 independent comparison tabs - **Performance**: Efficient comparison for documents up to 10,000+ lines - **Accuracy**: Precise change detection using difflib.SequenceMatcher algorithm - **File Support**: Load text files directly into input tabs #### Configuration ##### Settings Panel Options - **Ignore Case**: Perform case-insensitive comparison (default) - **Match Case**: Perform case-sensitive comparison - **Ignore Whitespace**: Normalize whitespace before comparison ##### Default Settings ```json { "option": "ignore_case" } ``` ##### Interface Layout **Input Pane (Left):** - **Title Row**: "Input" label with buttons - 📁 Load from File button - ⌫ Erase All Tabs button - **Filter Row**: Filter field with clear button (✕) - **Notebook**: 7 tabs for input text - **Statistics Bar**: Shows bytes, words, sentences, lines, tokens **Output Pane (Right):** - **Title Row**: "Output" label with buttons - "Send to Input" dropdown menu (send to specific tabs) - ⎘ Copy to Clipboard button - ⌫ Erase All Tabs button - **Filter Row**: Filter field with clear button (✕) - **Notebook**: 7 tabs for output text - **Statistics Bar**: Shows bytes, words, sentences, lines, tokens **Comparison Controls:** - **Comparison Mode Selection**: Radio buttons or dropdown for mode selection - **Compare Button**: Triggers comparison of active tabs - **Tab Navigation**: Click tabs to switch between different comparisons ##### Keyboard Shortcuts - **Mouse Wheel**: Synchronized scrolling in both panes - **Ctrl+A**: Select all text in active pane - **Ctrl+C**: Copy selected text - **Ctrl+V**: Paste text - **Tab Navigation**: Click tab numbers to switch #### Usage Examples ##### Example 1: Basic Text Comparison **Input Tab (Left):** ``` Hello World This is a test Goodbye ``` **Output Tab (Right):** ``` Hello World This is a demo Farewell ``` **Configuration:** - Mode: Ignore case **Visual Result:** ``` Input Pane: Output Pane: Hello World Hello World This is a test [RED] This is a demo [GREEN] Goodbye [RED] Farewell [GREEN] ``` **Statistics:** - Input: `Bytes: 35 | Word: 6 | Sentence: 1 | Line: 3 | Tokens: 8` - Output: `Bytes: 35 | Word: 6 | Sentence: 1 | Line: 3 | Tokens: 8` ##### Example 2: Case Sensitivity Comparison **Input Text:** ``` Hello World TESTING ``` **Output Text:** ``` hello world testing ``` **Configuration - Ignore Case:** ``` Input Pane: Output Pane: Hello World hello world TESTING testing ``` (No highlighting - treated as identical) **Statistics:** - Input: `Bytes: 19 | Word: 3 | Sentence: 1 | Line: 2 | Tokens: 4` - Output: `Bytes: 19 | Word: 3 | Sentence: 1 | Line: 2 | Tokens: 4` **Configuration - Match Case:** ``` Input Pane: Output Pane: Hello World [RED] hello world [GREEN] TESTING [RED] testing [GREEN] ``` **Statistics:** - Input: `Bytes: 19 | Word: 3 | Sentence: 1 | Line: 2 | Tokens: 4` - Output: `Bytes: 19 | Word: 3 | Sentence: 1 | Line: 2 | Tokens: 4` ##### Example 3: Whitespace Normalization **Input Text:** ``` Hello World This is a test ``` **Output Text:** ``` Hello World This is a test ``` **Configuration:** - Mode: Ignore whitespace **Result:** ``` Input Pane: Output Pane: Hello World Hello World This is a test This is a test ``` (No highlighting - whitespace differences ignored) **Statistics:** - Input: `Bytes: 32 | Word: 6 | Sentence: 1 | Line: 2 | Tokens: 8` - Output: `Bytes: 24 | Word: 6 | Sentence: 1 | Line: 2 | Tokens: 6` ##### Example 4: Line Filtering (NEW) **Input Text (After Comparison):** ``` Line 1: Introduction Line 2: Main content here Line 3: More content Line 4: Conclusion Line 5: Final thoughts ``` **Filter Applied**: "content" **Filtered Result:** ``` Input Pane (Filtered): Line 2: Main content here Line 3: More content ``` **Statistics (Updated for Filtered Content):** - Input: `Bytes: 45 | Word: 8 | Sentence: 1 | Line: 2 | Tokens: 11` **Clear Filter**: Click ✕ button to restore all lines **Original Content Restored:** ``` Input Pane: Line 1: Introduction Line 2: Main content here Line 3: More content Line 4: Conclusion Line 5: Final thoughts ``` **Statistics (Restored):** - Input: `Bytes: 98 | Word: 15 | Sentence: 1 | Line: 5 | Tokens: 24` ##### Example 5: Word-Level Highlighting **Input Text:** ``` The quick brown fox jumps ``` **Output Text:** ``` The fast brown fox leaps ``` **Configuration:** - Mode: Match case **Result:** ``` Input Pane: Output Pane: The quick brown fox jumps [BLUE] The fast brown fox leaps [BLUE] ^^^^^ (darker red) ^^^^ (darker green) ^^^^^ (darker red) ^^^^^ (darker green) ``` **Explanation**: Line is marked as modified (blue background), with specific words "quick"/"fast" and "jumps"/"leaps" highlighted at word level. ##### Example 6: Complex Document Comparison **Input Text:** ``` # Document Title Introduction paragraph here. - First bullet point - Second bullet point Conclusion paragraph. ``` **Output Text:** ``` # Document Title Introduction paragraph updated. - First bullet point - Second bullet point modified - Third bullet point added Conclusion paragraph. ``` **Visual Result:** ``` Input Pane: Output Pane: # Document Title # Document Title Introduction paragraph here. [RED] Introduction paragraph updated. [GREEN] - First bullet point - First bullet point - Second bullet point [RED] - Second bullet point modified [GREEN] - Third bullet point added [GREEN] Conclusion paragraph. Conclusion paragraph. ``` **Statistics:** - Input: `Bytes: 112 | Word: 15 | Sentence: 2 | Line: 5 | Tokens: 28` - Output: `Bytes: 145 | Word: 19 | Sentence: 2 | Line: 6 | Tokens: 36` ##### Example 7: Multi-Tab Comparison **Scenario**: Comparing multiple document versions simultaneously **Tab 1**: Version 1 vs Version 2 **Tab 2**: Version 2 vs Version 3 **Tab 3**: Original vs Final **Benefits:** - Compare multiple versions side-by-side - Switch between comparisons without losing work - Independent filters and statistics for each tab - Efficient workflow for document review ##### Example 8: Filter with Comparison Results **After Comparison** (showing differences): ``` Input Pane: Line 1: No changes here Line 2: Error in processing [RED] Line 3: No changes here Line 4: Warning detected [RED] Line 5: No changes here ``` **Apply Filter**: "Error" **Filtered Result:** ``` Input Pane (Filtered): Line 2: Error in processing [RED] ``` **Use Case**: Quickly focus on specific types of changes (errors, warnings, specific keywords) ##### Example 9: Empty Document Handling **Input Text:** ``` (empty) ``` **Output Text:** ``` Hello World This is new content ``` **Result:** ``` Input Pane: Output Pane: Hello World [GREEN] This is new content [GREEN] ``` **Explanation**: All output lines shown as additions (green) since input is empty. ##### Example 10: Statistics Comparison **Input Text:** ``` This is a short document with some content. It has multiple sentences. And several words. ``` **Output Text:** ``` This is a longer document with additional content and more details. It has multiple sentences with extra information. And many more words than before. ``` **Statistics Comparison:** - Input: `Bytes: 89 | Word: 15 | Sentence: 3 | Line: 2 | Tokens: 22` - Output: `Bytes: 142 | Word: 24 | Sentence: 3 | Line: 2 | Tokens: 35` **Analysis**: Output has 60% more bytes, 60% more words, and 59% more tokens - indicating significant content expansion. #### Common Use Cases 1. **Document Revision**: Compare different versions of documents 2. **Code Review**: Compare code changes and modifications 3. **Content Editing**: Review editorial changes and revisions 4. **Translation Comparison**: Compare original and translated texts 5. **Data Validation**: Verify data transformations and processing 6. **Configuration Management**: Compare configuration files 7. **Quality Assurance**: Verify content changes and updates 8. **Legal Document Review**: Compare contract versions and amendments #### Technical Implementation ##### Class Structure ```python class DiffViewerWidget: """A comprehensive diff viewer widget with side-by-side text comparison.""" def __init__(self, parent, tab_count=7, logger=None, parent_callback=None, dialog_manager=None): self.tab_count = tab_count self.settings = {"option": "ignore_case"} self.input_original_content = {} # For filter restoration self.output_original_content = {} # For filter restoration self._create_ui() ``` ##### Diff Algorithm The Diff Viewer uses Python's `difflib.SequenceMatcher` for accurate line-by-line comparison: ```python def run_comparison(self, option=None): """Compare the active tabs and display the diff.""" # Get active tab content input_text = input_widget.get("1.0", tk.END) output_text = output_widget.get("1.0", tk.END) # Remove trailing newline that tkinter adds if input_text.endswith('\n'): input_text = input_text[:-1] if output_text.endswith('\n'): output_text = output_text[:-1] # Clear filters before comparison self.input_filter_var.set("") self.output_filter_var.set("") # Preprocess texts according to comparison mode left_lines = self._preprocess_for_diff(input_text, option) right_lines = self._preprocess_for_diff(output_text, option) # Extract comparison strings left_cmp = [l["cmp"] for l in left_lines] right_cmp = [r["cmp"] for r in right_lines] # Perform diff comparison import difflib matcher = difflib.SequenceMatcher(None, left_cmp, right_cmp, autojunk=False) # Process diff operations for tag, i1, i2, j1, j2 in matcher.get_opcodes(): if tag == 'equal': # Handle identical lines for i in range(i1, i2): input_widget.insert(tk.END, left_lines[i]["raw"] + '\n') output_widget.insert(tk.END, right_lines[j1 + (i - i1)]["raw"] + '\n') elif tag == 'delete': # Handle deleted lines (red background, empty line in output) for i in range(i1, i2): input_widget.insert(tk.END, left_lines[i]["raw"] + '\n', 'deletion') output_widget.insert(tk.END, '\n') elif tag == 'insert': # Handle inserted lines (empty line in input, green background) for j in range(j1, j2): input_widget.insert(tk.END, '\n') output_widget.insert(tk.END, right_lines[j]["raw"] + '\n', 'addition') elif tag == 'replace': # Handle modified lines with word-level highlighting self._highlight_word_diffs(input_widget, left_lines[i1:i2], output_widget, right_lines[j1:j2]) ``` ##### Preprocessing Logic ```python def _preprocess_for_diff(self, text, option): """Preprocess text into line dicts according to diff option.""" lines = text.splitlines() processed = [] for line in lines: cmp_line = line if option == "ignore_case": cmp_line = cmp_line.lower() elif option == "ignore_whitespace": cmp_line = re.sub(r"\s+", " ", cmp_line).strip() processed.append({"raw": line, "cmp": cmp_line}) return processed ``` ##### Line Filtering Implementation (NEW) ```python def _on_input_filter_changed(self, *args): """Handle input filter changes.""" filter_text = self.input_filter_var.get() active_idx = self.input_notebook.index("current") current_tab = self.input_tabs[active_idx] if not filter_text: # Restore original content if filter is cleared if active_idx in self.input_original_content: current_tab.text.delete("1.0", tk.END) current_tab.text.insert("1.0", self.input_original_content[active_idx]) del self.input_original_content[active_idx] else: # Store original content if not already stored if active_idx not in self.input_original_content: self.input_original_content[active_idx] = current_tab.text.get("1.0", tk.END) # Apply filter original_content = self.input_original_content[active_idx] lines = original_content.split('\n') filtered_lines = [line for line in lines if filter_text.lower() in line.lower()] filtered_content = '\n'.join(filtered_lines) # Update display current_tab.text.delete("1.0", tk.END) current_tab.text.insert("1.0", filtered_content) # Update statistics self.update_statistics() def _clear_input_filter(self): """Clear the input filter.""" self.input_filter_var.set("") ``` ##### Statistics Calculation (NEW) ```python def update_statistics(self): """Update statistics bars for active tabs.""" try: active_input_tab = self.input_tabs[self.input_notebook.index("current")] active_output_tab = self.output_tabs[self.output_notebook.index("current")] # Calculate input statistics input_content = active_input_tab.text.get("1.0", tk.END).rstrip('\n') input_stats = self._calculate_statistics(input_content) self.input_stats_bar.config(text=input_stats) # Calculate output statistics output_content = active_output_tab.text.get("1.0", tk.END).rstrip('\n') output_stats = self._calculate_statistics(output_content) self.output_stats_bar.config(text=output_stats) except: pass def _calculate_statistics(self, text): """Calculate comprehensive statistics for text.""" if not text: return "Bytes: 0 | Word: 0 | Sentence: 0 | Line: 0 | Tokens: 0" # Bytes (with K/M formatting) byte_count = len(text.encode('utf-8')) if byte_count >= 1_000_000: bytes_str = f"{byte_count / 1_000_000:.1f}M" elif byte_count >= 1_000: bytes_str = f"{byte_count / 1_000:.1f}K" else: bytes_str = str(byte_count) # Words words = [w for w in text.split() if w] word_count = len(words) # Sentences sentence_count = max(1, text.count('.') + text.count('!') + text.count('?')) # Lines line_count = text.count('\n') + 1 # Tokens (rough estimate for AI processing) token_count = len(text) // 4 return f"Bytes: {bytes_str} | Word: {word_count} | Sentence: {sentence_count} | Line: {line_count} | Tokens: {token_count}" ``` ##### Word-Level Highlighting ```python def _highlight_word_diffs(self, w1, lines1, w2, lines2): """Highlight word-level differences within a 'replace' block.""" for line1, line2 in zip(lines1, lines2): # Insert lines with modification background w1.insert(tk.END, line1 + '\n', 'modification') w2.insert(tk.END, line2 + '\n', 'modification') # Get line positions line_start1 = w1.index(f"{w1.index(tk.INSERT)} -1 lines linestart") line_start2 = w2.index(f"{w2.index(tk.INSERT)} -1 lines linestart") # Split into words words1 = re.split(r'(\s+)', line1) words2 = re.split(r'(\s+)', line2) # Find word-level differences import difflib matcher = difflib.SequenceMatcher(None, words1, words2) for tag, i1, i2, j1, j2 in matcher.get_opcodes(): if tag == 'delete' or tag == 'replace': start_char1 = len("".join(words1[:i1])) end_char1 = len("".join(words1[:i2])) w1.tag_add('inline_del', f"{line_start1}+{start_char1}c", f"{line_start1}+{end_char1}c") if tag == 'insert' or tag == 'replace': start_char2 = len("".join(words2[:j1])) end_char2 = len("".join(words2[:j2])) w2.tag_add('inline_add', f"{line_start2}+{start_char2}c", f"{line_start2}+{end_char2}c") ``` ##### Visual Highlighting - **Tag-Based Highlighting**: Uses Tkinter text tags for color coding - **Color Tags**: `addition`, `deletion`, `modification`, `inline_add`, `inline_del` - **Word-Level Differences**: Highlights specific words within changed lines using regex word splitting - **Synchronized Display**: Maintains alignment between left and right panes ##### Synchronized Scrolling ```python def _sync_scroll(self, *args): """Sync both text widgets when one's scrollbar is used.""" active_input_tab.text.yview(*args) active_output_tab.text.yview(*args) def _on_mousewheel(self, event): """Handle mouse wheel scrolling over either text widget.""" if platform.system() == "Windows": delta = int(-1*(event.delta/120)) elif platform.system() == "Darwin": delta = int(-1 * event.delta) else: delta = -1 if event.num == 4 else 1 active_input_tab.text.yview_scroll(delta, "units") active_output_tab.text.yview_scroll(delta, "units") return "break" ``` ##### Tab Change Handling ```python def _on_tab_changed(self, event=None): """Handle tab change events.""" # Clear filters when switching tabs self.input_filter_var.set("") self.output_filter_var.set("") # Update synchronized scrolling self._setup_sync(event) # Update statistics self.update_statistics() ``` ##### Dependencies - **Required**: Python standard library (tkinter, difflib, re, platform modules) - **Optional**: - `core.efficient_line_numbers.OptimizedTextWithLineNumbers` (enhanced line numbers) - `core.memory_efficient_text_widget.MemoryEfficientTextWidget` (memory optimization) - **Fallback**: Basic `TextWithLineNumbers` implementation if optimized components unavailable ##### Performance Considerations - **Memory Efficient**: Processes texts line by line, stores only filtered content - **Scalable**: Handles documents up to 10,000+ lines efficiently - **Responsive**: Real-time comparison and highlighting with minimal lag - **Filter Performance**: O(n) complexity for line filtering using list comprehensions - **Statistics Calculation**: < 10ms for typical documents - **Comparison Speed**: ~100ms for 1,000 lines, ~1s for 10,000 lines #### Interface Features ##### Tabbed Comparison - **Multiple Pairs**: Compare multiple document pairs simultaneously - **Tab Navigation**: Easy switching between different comparisons - **Independent Settings**: Each comparison can use different modes ##### Synchronized Scrolling - **Coordinated Navigation**: Scrolling one pane automatically scrolls the other - **Alignment Maintenance**: Keeps corresponding lines aligned - **Visual Consistency**: Maintains visual relationship between changes ##### Content Management - **Load from Main Tabs**: Import content from main application tabs - **Send to Input**: Export comparison results back to main tabs - **Real-Time Updates**: Immediate visual feedback on changes #### Best Practices ##### Recommended Usage - **Preprocessing**: Choose appropriate comparison mode for your use case (ignore case for general text, match case for code) - **Document Preparation**: Ensure texts are properly formatted before comparison - **Visual Review**: Use color coding to quickly identify change types (red=deleted, green=added, blue=modified) - **Systematic Review**: Review changes systematically from top to bottom - **Line Filtering**: Use filters to focus on specific types of changes (errors, warnings, keywords) - **Statistics Monitoring**: Check statistics to understand the scope of changes - **Multi-Tab Workflow**: Use multiple tabs to compare different versions simultaneously - **Filter Clearing**: Remember filters auto-clear when switching tabs or running new comparisons ##### Performance Tips - **Document Size**: Tool handles documents up to 10,000+ lines efficiently - **Comparison Mode**: Choose appropriate mode to reduce false positives - Use "Ignore Case" for general text comparison - Use "Match Case" for code or formal documents - Use "Ignore Whitespace" for formatted text with varying spacing - **Memory Usage**: Large documents (>10MB) may require more processing time - **Visual Clarity**: Use appropriate zoom levels for comfortable reading - **Filter Performance**: Filters update in real-time with minimal performance impact - **Statistics Calculation**: Statistics update automatically with < 10ms overhead ##### Common Pitfalls - **Whitespace Sensitivity**: Be aware of whitespace handling in different modes - **Case Sensitivity**: Choose appropriate case handling for your comparison needs - **Line Endings**: Different line ending formats (CRLF vs LF) may affect comparison - **Character Encoding**: Ensure both texts use consistent character encoding (UTF-8 recommended) - **Trailing Newlines**: Tkinter automatically adds trailing newlines - these are handled automatically - **Filter State**: Filters are automatically cleared when switching tabs or running comparisons - **Empty Documents**: One or both documents can be empty - tool handles this gracefully - **Word-Level Highlighting**: Only visible in "replace" operations, not in pure additions/deletions ##### Line Filtering Best Practices (NEW) - **Use Case-Insensitive Filters**: Filters are case-insensitive for better usability - **Filter for Specific Changes**: Use filters to focus on errors, warnings, or specific keywords - **Clear Filters Regularly**: Click ✕ button to restore full content - **Check Statistics**: Statistics update to reflect filtered content - **Original Content Preserved**: Original content is stored and restored when filter is cleared - **Independent Filters**: Input and output filters work independently - **Auto-Clear Behavior**: Filters automatically clear when switching tabs or running new comparisons ##### Statistics Interpretation (NEW) - **Bytes**: Total size in UTF-8 encoding (K/M formatting for large files) - **Words**: Whitespace-separated non-empty strings - **Sentences**: Count of sentence-ending punctuation (., !, ?) - **Lines**: Number of lines (newline count + 1) - **Tokens**: Rough estimate for AI processing (characters / 4) - **Comparison**: Compare statistics between input and output to understand change scope - **Filtered Statistics**: Statistics update to reflect filtered content #### Comparison Algorithms ##### SequenceMatcher Algorithm - **Longest Common Subsequence**: Finds optimal alignment between texts - **Change Minimization**: Minimizes the number of operations needed - **Accuracy**: Provides precise change detection - **Performance**: Efficient for typical document sizes ##### Operation Types 1. **Equal**: Sequences that are identical 2. **Delete**: Sequences present only in the first text 3. **Insert**: Sequences present only in the second text 4. **Replace**: Sequences that differ between texts #### New Features Summary ##### Line Filtering (NEW) **Overview**: Real-time line filtering capability for both input and output panes. **Features:** - Real-time filtering as you type - Case-insensitive search for better usability - Clear button (✕) for quick filter removal - Automatic filter clearing when switching tabs or running comparisons - Original content preservation and restoration - Statistics update to reflect filtered content - Independent filters for input and output panes **Use Cases:** - Focus on specific types of changes (errors, warnings, keywords) - Quickly find lines containing specific text - Filter out noise to focus on relevant changes - Analyze specific patterns in comparison results **Implementation Details:** - Filter text stored in `input_filter_var` and `output_filter_var` - Original content stored in `input_original_content` and `output_original_content` dictionaries - Filter logic: `filter_text.lower() in line.lower()` (case-insensitive) - Automatic cleanup on tab switch and new comparison ##### Enhanced Statistics (NEW) **Overview**: Comprehensive statistics display for both input and output panes. **Metrics:** - **Bytes**: UTF-8 encoded byte count with K/M formatting - **Words**: Count of whitespace-separated non-empty strings - **Sentences**: Count of sentence-ending punctuation (., !, ?) - **Lines**: Number of lines (newline count + 1) - **Tokens**: Rough estimate for AI processing (characters / 4) **Features:** - Real-time updates when content changes - Statistics bars below each notebook - Format: `Bytes: 1.5K | Word: 234 | Sentence: 12 | Line: 45 | Tokens: 567` - Updates reflect filtered content when filters are applied - Helps understand scope of changes between documents **Use Cases:** - Understand document size and complexity - Compare content volume between input and output - Estimate AI processing requirements (token count) - Track changes in document statistics after filtering ##### Improved Error Handling (NEW) **Overview**: Robust error handling and state management for edge cases. **Improvements:** - Consistent trailing newline handling (Tkinter adds trailing newlines automatically) - Proper empty document handling (one or both documents empty) - Graceful degradation if operations fail - Detailed error logging for debugging - State cleanup when switching contexts - Filter state management across tab changes **Benefits:** - More reliable comparison results - Better user experience with edge cases - Easier debugging and troubleshooting - Consistent behavior across all operations ##### Automatic Filter Clearing (NEW) **Overview**: Filters automatically clear in specific scenarios to avoid confusion. **Scenarios:** - **Tab Switch**: Filters clear when switching between tabs - **New Comparison**: Filters clear when running a new comparison - **Manual Clear**: User clicks ✕ button to clear filter **Benefits:** - Prevents confusion from persistent filters - Ensures fresh start for each comparison - Maintains clean state across operations - Reduces user errors from forgotten filters #### Error Handling ##### Empty Documents - **One Empty**: Shows all content as additions (green) or deletions (red) - **Both Empty**: No comparison performed, no highlighting - **Graceful Handling**: Appropriate visual feedback for edge cases - **Statistics**: Shows "0" for all metrics when document is empty ##### Large Documents - **Memory Management**: Efficient processing of large texts using line-by-line processing - **Performance Monitoring**: Tracks processing time for large comparisons - **User Feedback**: Statistics provide immediate feedback on document size - **Scalability**: Handles documents up to 10,000+ lines efficiently - **Optimization**: Uses optimized text widgets when available ##### Malformed Input - **Encoding Issues**: Handles different character encodings gracefully (UTF-8 recommended) - **Line Ending Variations**: Normalizes different line ending formats (CRLF, LF, CR) - **Special Characters**: Properly handles Unicode and special characters - **Trailing Newlines**: Automatically removes trailing newlines added by Tkinter - **Empty Lines**: Handles documents with many empty lines correctly ##### Filter Edge Cases - **No Matches**: Shows empty content if filter matches no lines - **Empty Filter**: Restores original content when filter is cleared - **Filter During Comparison**: Filters are cleared before running new comparison - **Tab Switch with Filter**: Filters are cleared when switching tabs - **Original Content Loss**: Original content is preserved in memory for restoration ##### State Management - **Tab Switching**: Proper cleanup of filters and state when switching tabs - **Comparison Reset**: Clear stored original content when running new comparison - **Filter Restoration**: Original content restored when filter is cleared - **Statistics Update**: Statistics always reflect current view (filtered or unfiltered) - **Synchronized Scrolling**: Maintains synchronization across all operations #### Integration with Other Tools ##### Workflow Examples **Example 1: Basic Comparison** 1. Paste text into Input tab 2. Paste text into Output tab 3. Click "Compare Active Tabs" 4. View differences with color highlighting **Example 2: File Comparison** 1. Click 📁 to load file into Input 2. Click 📁 to load another file into Output 3. Select comparison mode 4. Click "Compare Active Tabs" **Example 3: Using Filters** 1. After comparison, type in Filter field 2. Only matching lines are shown 3. Statistics update automatically 4. Click ✕ to restore all lines **Example 4: Multiple Comparisons** 1. Use Tab 1 for first comparison 2. Switch to Tab 2 for second comparison 3. Filters auto-clear when switching 4. Each tab maintains its own content **Example 5: Cross-Tool Workflows** 1. **Process → Compare → Review**: - Text processing → Diff Viewer → Manual review 2. **Compare → Extract → Analyze**: - Diff Viewer → Find & Replace → Word Frequency Counter 3. **Version Control → Compare → Validate**: - Document versions → Diff Viewer → Quality assurance #### Keyboard Shortcuts - **Ctrl+C**: Copy selected text - **Ctrl+V**: Paste text - **Ctrl+A**: Select all text - **Ctrl+Z**: Undo (in text areas) - **Ctrl+Y**: Redo (in text areas) - **Mouse Wheel**: Synchronized scrolling (both sides scroll together) #### Tips & Best Practices - Use "Ignore case" for most text comparisons - Use "Ignore whitespace" for code with different indentation - Use filters to focus on specific changes - Clear filters before running new comparisons - Use multiple tabs for comparing different file pairs - Statistics help identify the scope of changes - Word-level highlighting shows exact differences in modified lines - Synchronized scrolling keeps both sides aligned #### Statistics Explained **Statistics Format**: `Bytes: 1.5K | Word: 234 | Sentence: 12 | Line: 45 | Tokens: 350` - **Bytes**: File size (with K/M suffixes for readability) - **Word**: Number of whitespace-separated words - **Sentence**: Count of sentence-ending punctuation (., !, ?) - **Line**: Number of lines in the text - **Tokens**: Estimated token count (useful for AI/LLM context limits) Statistics update in real-time as you: - Type or edit text - Apply filters - Switch tabs - Run comparisons #### Troubleshooting ##### Issue: Filters not clearing **Solution**: Filters automatically clear when switching tabs or running new comparisons. If filter persists, click the ✕ button to manually clear it. ##### Issue: Statistics not updating **Solution**: Statistics update automatically when content changes. If statistics seem incorrect, try switching tabs and back, or run a new comparison. ##### Issue: Comparison not showing differences **Solution**: Check your comparison mode - "Ignore Case" may treat differences as identical. Try "Match Case" mode for more precise comparison. ##### Issue: Word-level highlighting not visible **Solution**: Word-level highlighting only appears in "replace" operations (modified lines). Pure additions/deletions don't show word-level highlighting. ##### Issue: Synchronized scrolling not working **Solution**: Synchronized scrolling is set up when tabs are activated. Try switching to another tab and back to reset synchronization. ##### Issue: Large documents slow to compare **Solution**: Documents over 10,000 lines may take longer to process. Consider breaking large documents into smaller sections for comparison. ##### Issue: Filter shows no results **Solution**: If filter matches no lines, the pane will be empty. Clear the filter with ✕ button to restore all content. ##### Issue: Original content lost after filtering **Solution**: Original content is preserved in memory. Clear the filter with ✕ button to restore it. If content is truly lost, it may have been overwritten by a new comparison. #### Related Tools - **Find & Replace Text**: Process texts before comparison to normalize content - **Word Frequency Counter**: Analyze word usage patterns in compared documents - **Case Tool**: Normalize case before comparison for consistent results - **List Comparator**: Compare lists line-by-line with different output format - **Email Header Analyzer**: Compare email headers for routing analysis - **Alphabetical Sorter**: Sort lines before comparison for better alignment #### Integration Workflows ##### Document Review Workflow 1. **Load Documents**: Load two document versions into input and output tabs 2. **Run Comparison**: Select comparison mode and run comparison 3. **Review Changes**: Use color coding to identify additions, deletions, modifications 4. **Filter Specific Changes**: Apply filters to focus on specific types of changes 5. **Analyze Statistics**: Compare statistics to understand scope of changes 6. **Export Results**: Copy results to clipboard or send to input tabs ##### Code Review Workflow 1. **Load Code Files**: Load original and modified code into tabs 2. **Match Case Comparison**: Use "Match Case" mode for precise code comparison 3. **Review Line Changes**: Identify added, deleted, and modified lines 4. **Word-Level Analysis**: Check word-level highlighting for specific changes 5. **Filter by Keyword**: Filter for specific functions, variables, or comments 6. **Multi-File Comparison**: Use multiple tabs to compare different files ##### Content Editing Workflow 1. **Load Original and Edited**: Load original and edited content 2. **Ignore Case Comparison**: Use "Ignore Case" for general content comparison 3. **Review Editorial Changes**: Identify content additions, deletions, modifications 4. **Filter by Topic**: Use filters to focus on specific topics or sections 5. **Statistics Comparison**: Compare word counts and sentence counts 6. **Validate Changes**: Ensure all intended changes are present #### Performance Benchmarks ##### Comparison Speed - **100 lines**: < 10ms - **1,000 lines**: ~100ms - **10,000 lines**: ~1s - **50,000 lines**: ~5s (may vary based on system) ##### Filter Performance - **Filter Application**: < 50ms for typical documents - **Filter Clearing**: < 10ms (instant restoration) - **Statistics Update**: < 10ms ##### Memory Usage - **Base Widget**: ~5MB - **Per Tab**: ~2MB additional - **Large Document (10MB)**: +50MB during comparison - **Filter Storage**: Minimal (stores original content per tab) #### See Also - [Word Frequency Counter Documentation](#word-frequency-counter) - [List Comparator Documentation](#list-comparator-widget) - [Analysis & Comparison Tools Overview](#analysis--comparison-tools-2-tools) - [Text Comparison Best Practices](#best-practices) - [Archive Documentation: DIFF_VIEWER_IMPROVEMENTS.md](archive/DIFF_VIEWER_IMPROVEMENTS.md) --- ## Word Frequency Counter **Note**: Word Frequency Counter is now integrated into the **Text Statistics** tool as a dedicated button. The functionality remains the same, but it's accessed through the Text Statistics interface. See [Text Statistics Documentation](#text-statistics) for details. **Category**: Analysis & Comparison Tools **Availability**: Always Available (via Text Statistics) **TextProcessor Method**: `word_frequency()` #### Description The Word Frequency Counter is a statistical text analysis tool that analyzes word usage patterns in text documents. It counts the occurrence of each word, calculates frequency percentages, and presents results in descending order of frequency, making it invaluable for content analysis, writing improvement, and linguistic research. The tool is now accessible as a button within the Text Statistics tool interface. #### Key Features - **Comprehensive Word Counting**: Counts all words in the input text - **Statistical Analysis**: Calculates both absolute counts and percentage frequencies - **Frequency Ranking**: Sorts results by frequency from most to least common - **Case Normalization**: Converts all text to lowercase for consistent counting - **Word Boundary Detection**: Uses regex word boundaries for accurate word identification - **Percentage Calculation**: Shows relative frequency as percentage of total words #### Capabilities ##### Core Functionality - **Word Extraction**: Identifies individual words using regex pattern `\b\w+\b` - **Frequency Counting**: Counts occurrences of each unique word - **Statistical Calculation**: Computes percentage frequency for each word - **Ranking**: Orders results from most frequent to least frequent words ##### Word Recognition - **Word Boundaries**: Uses `\b\w+\b` regex pattern for precise word detection - **Alphanumeric Characters**: Recognizes letters, numbers, and underscores as word characters - **Case Insensitive**: Converts all text to lowercase before analysis - **Punctuation Handling**: Automatically excludes punctuation marks from word counts ##### Statistical Metrics - **Absolute Frequency**: Raw count of word occurrences - **Relative Frequency**: Percentage of total words - **Total Word Count**: Overall number of words in the text - **Unique Word Count**: Number of distinct words found ##### Input/Output Specifications - **Input**: Any text content (documents, articles, books, etc.) - **Output**: Ranked list of words with counts and percentages - **Performance**: Efficient processing for typical document sizes - **Accuracy**: Precise word counting with statistical calculations #### Configuration The Word Frequency Counter operates without configuration options - it automatically analyzes all words in the input text and provides comprehensive frequency statistics. #### Usage Examples ##### Basic Word Frequency Example **Input:** ``` The quick brown fox jumps over the lazy dog. The dog was very lazy. ``` **Output:** ``` the (3 / 23.08%) lazy (2 / 15.38%) dog (2 / 15.38%) brown (1 / 7.69%) fox (1 / 7.69%) jumps (1 / 7.69%) over (1 / 7.69%) quick (1 / 7.69%) very (1 / 7.69%) was (1 / 7.69%) ``` ##### Article Analysis Example **Input:** ``` Artificial intelligence is transforming technology. Machine learning and artificial intelligence are revolutionizing how we process data. The future of technology depends on artificial intelligence. ``` **Output:** ``` artificial (3 / 15.79%) intelligence (3 / 15.79%) technology (2 / 10.53%) and (1 / 5.26%) are (1 / 5.26%) data (1 / 5.26%) depends (1 / 5.26%) future (1 / 5.26%) how (1 / 5.26%) is (1 / 5.26%) learning (1 / 5.26%) machine (1 / 5.26%) of (1 / 5.26%) on (1 / 5.26%) process (1 / 5.26%) revolutionizing (1 / 5.26%) the (1 / 5.26%) transforming (1 / 5.26%) we (1 / 5.26%) ``` ##### Short Text Analysis Example **Input:** ``` Hello world hello ``` **Output:** ``` hello (2 / 66.67%) world (1 / 33.33%) ``` ##### Mixed Case Text Example **Input:** ``` The THE the ThE ``` **Output:** ``` the (4 / 100.00%) ``` ##### Numbers and Words Example **Input:** ``` There are 5 cats and 3 dogs in the house. The 5 cats are sleeping. ``` **Output:** ``` the (2 / 16.67%) 5 (2 / 16.67%) are (2 / 16.67%) cats (2 / 16.67%) and (1 / 8.33%) dogs (1 / 8.33%) house (1 / 8.33%) in (1 / 8.33%) sleeping (1 / 8.33%) there (1 / 8.33%) 3 (1 / 8.33%) ``` #### Common Use Cases 1. **Content Analysis**: Analyze word usage patterns in articles, blogs, or documents 2. **Writing Improvement**: Identify overused words and improve writing variety 3. **SEO Optimization**: Analyze keyword density and frequency in web content 4. **Academic Research**: Study linguistic patterns in literature or research papers 5. **Market Research**: Analyze word frequency in customer feedback or surveys 6. **Social Media Analysis**: Study word usage patterns in social media posts 7. **Translation Quality**: Compare word frequency between original and translated texts 8. **Readability Assessment**: Analyze vocabulary complexity and word repetition #### Technical Implementation ##### TextProcessor Method ```python @staticmethod def word_frequency(text): """Counts the frequency of each word in the text.""" words = re.findall(r'\b\w+\b', text.lower()) if not words: return "No words found." from collections import Counter word_counts = Counter(words) total_words = len(words) report = [] for word, count in word_counts.most_common(): percentage = (count / total_words) * 100 report.append(f"{word} ({count} / {percentage:.2f}%)") return '\n'.join(report) ``` ##### Algorithm Details **Word Extraction:** 1. Convert input text to lowercase using `text.lower()` 2. Extract words using regex pattern `r'\b\w+\b'` 3. Pattern matches word boundaries and alphanumeric characters **Frequency Counting:** 1. Use `collections.Counter` for efficient counting 2. Count occurrences of each unique word 3. Calculate total word count **Statistical Analysis:** 1. Calculate percentage: `(count / total_words) * 100` 2. Round percentages to 2 decimal places 3. Sort results by frequency using `most_common()` **Output Formatting:** 1. Format each entry as: `word (count / percentage%)` 2. Join all entries with newlines 3. Present in descending frequency order ##### Word Recognition Pattern - **`\b`**: Word boundary (start/end of word) - **`\w+`**: One or more word characters (letters, digits, underscore) - **`\b`**: Word boundary (end of word) This pattern ensures accurate word detection while excluding punctuation and whitespace. ##### Dependencies - **Required**: Python standard library (re, collections modules) - **Optional**: None ##### Performance Considerations - **Memory Efficient**: Uses Counter for optimized counting - **Fast Processing**: Regex-based word extraction is very efficient - **Scalable**: Handles documents of typical sizes without performance issues #### Statistical Analysis Features ##### Frequency Metrics - **Absolute Frequency**: Raw count of each word's occurrences - **Relative Frequency**: Percentage of total words each word represents - **Ranking**: Words ordered from most to least frequent - **Total Count**: Overall number of words analyzed ##### Data Insights - **Most Common Words**: Identifies frequently used terms - **Vocabulary Diversity**: Shows range of unique words used - **Usage Patterns**: Reveals writing style and emphasis - **Keyword Density**: Useful for SEO and content optimization #### Best Practices ##### Recommended Usage - **Content Review**: Use to identify overused words in writing - **Keyword Analysis**: Analyze keyword density for SEO purposes - **Comparative Analysis**: Compare word frequency across different texts - **Writing Improvement**: Identify repetitive language patterns ##### Performance Tips - **Large Documents**: Tool handles typical document sizes efficiently - **Memory Usage**: Counter object is memory-efficient for word counting - **Processing Speed**: Regex-based extraction is fast and reliable - **Result Interpretation**: Focus on high-frequency words for insights ##### Common Pitfalls - **Punctuation Exclusion**: Punctuation marks are not counted as words - **Case Insensitivity**: All words are converted to lowercase - **Number Inclusion**: Numbers are treated as words if they contain word characters - **Hyphenated Words**: Hyphenated words may be split depending on context #### Analysis Applications ##### Content Writing - **Word Variety**: Identify overused words to improve writing diversity - **Style Analysis**: Understand writing patterns and tendencies - **Readability**: Assess vocabulary complexity and repetition - **Editing**: Find words that appear too frequently ##### SEO and Marketing - **Keyword Density**: Analyze keyword frequency for search optimization - **Content Optimization**: Ensure balanced keyword usage - **Competitor Analysis**: Compare word usage with competitor content - **Brand Messaging**: Analyze consistency in brand language ##### Academic Research - **Linguistic Analysis**: Study word usage patterns in literature - **Comparative Studies**: Compare vocabulary across different texts - **Content Analysis**: Quantitative analysis of textual content - **Research Validation**: Verify consistency in academic writing #### Error Handling ##### No Words Found **Input:** ``` !@#$%^&*() ``` **Output:** ``` No words found. ``` ##### Empty Input **Input:** ``` (empty) ``` **Output:** ``` No words found. ``` ##### Single Word **Input:** ``` hello ``` **Output:** ``` hello (1 / 100.00%) ``` #### Integration with Other Tools ##### Workflow Examples 1. **Process → Analyze → Optimize**: - Text processing → Word Frequency Counter → Content optimization 2. **Compare → Analyze → Report**: - Diff Viewer → Word Frequency Counter → Analysis report 3. **Extract → Count → Sort**: - Data extraction → Word Frequency Counter → Alphabetical Sorter #### Related Tools - **Alphabetical Sorter**: Sort word frequency results alphabetically - **Find & Replace Text**: Remove or modify specific words before analysis - **Case Tool**: Normalize text case before frequency analysis - **Diff Viewer**: Compare word frequency between different texts #### See Also - [Diff Viewer Documentation](#diff-viewer) - [List Comparator Documentation](#list-comparator-widget) - [Analysis & Comparison Tools Overview](#analysis--comparison-tools-2-tools) - [Statistical Analysis Applications](#analysis-applications) ---

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/matbanik/Pomera-AI-Commander'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

analysis-comparison.md•54.6 KiB