# Advanced Features
> Performance monitoring, batch processing, integration capabilities, and advanced configuration options.
---
## Advanced Features
### Async Processing Capabilities
**Module**: `async_text_processor.py`
**Availability**: Optional (enhances performance when available)
**Purpose**: Non-blocking text processing for large documents
#### Description
The Async Processing system provides sophisticated background processing capabilities that prevent UI freezing during heavy text operations. It automatically determines the optimal processing strategy based on content size and provides progress tracking, cancellation support, and intelligent chunking for large documents.
#### Key Features
- **Automatic Mode Detection**: Intelligently selects processing mode based on content size
- **Background Threading**: Non-blocking processing using ThreadPoolExecutor
- **Progress Tracking**: Real-time progress updates for long operations
- **Cancellation Support**: Ability to cancel long-running operations
- **Chunked Processing**: Breaks large texts into manageable chunks
- **Memory Optimization**: Efficient memory usage for large documents
#### Processing Modes
##### 1. Synchronous Mode (SYNC)
- **Content Size**: < 10KB
- **Behavior**: Processes immediately in main thread
- **Use Case**: Small texts that process quickly
- **Performance**: Instant processing, no overhead
##### 2. Asynchronous Mode (ASYNC)
- **Content Size**: 10KB - 100KB
- **Behavior**: Processes in background thread
- **Use Case**: Medium-sized texts that may cause brief UI delays
- **Performance**: Non-blocking UI, single background operation
##### 3. Chunked Mode (CHUNKED)
- **Content Size**: > 100KB
- **Behavior**: Splits text into chunks and processes with progress updates
- **Use Case**: Large documents that require significant processing time
- **Performance**: Progress tracking, memory efficient, cancellable
#### Technical Implementation
##### TextProcessingContext Class
```python
@dataclass
class TextProcessingContext:
"""Context information for text processing operations."""
content: str
content_hash: str
size_bytes: int
line_count: int
processing_mode: ProcessingMode
chunk_size: int = 50000
tool_name: str = ""
callback_id: str = ""
@classmethod
def from_content(cls, content: str, tool_name: str = "", callback_id: str = ""):
"""Create context from text content."""
# Automatic mode detection based on size
if size_bytes < 10000: # 10KB
mode = ProcessingMode.SYNC
elif size_bytes < 100000: # 100KB
mode = ProcessingMode.ASYNC
else:
mode = ProcessingMode.CHUNKED
```
##### AsyncTextProcessor Class
```python
class AsyncTextProcessor:
"""Asynchronous text processor with background threading and chunking support."""
def __init__(self, max_workers: int = 2):
self.executor = ThreadPoolExecutor(max_workers=max_workers)
self.active_tasks: Dict[str, Future] = {}
self.task_callbacks: Dict[str, Callable] = {}
self.progress_callbacks: Dict[str, Callable] = {}
def process_text_async(self, context, processor_func, callback, progress_callback=None):
"""Process text asynchronously with callback when complete."""
# Submit task based on processing mode
if context.processing_mode == ProcessingMode.CHUNKED:
future = self.executor.submit(self._process_chunked, context, processor_func, task_id)
else:
future = self.executor.submit(self._process_single, context, processor_func, task_id)
```
#### Chunking Strategy
##### Intelligent Text Chunking
- **Default Chunk Size**: 50,000 characters
- **Boundary Awareness**: Attempts to break at natural boundaries (lines, sentences)
- **Memory Efficiency**: Processes one chunk at a time to minimize memory usage
- **Progress Tracking**: Reports progress as chunks are completed
##### Chunk Combination
- **Tool-Specific Logic**: Different tools may combine chunks differently
- **Preservation**: Maintains text structure and formatting
- **Error Handling**: Handles partial failures gracefully
#### Performance Benefits
##### UI Responsiveness
- **Non-Blocking**: UI remains responsive during processing
- **Progress Feedback**: Users see real-time progress updates
- **Cancellation**: Users can cancel long operations
- **Resource Management**: Prevents UI freezing and system overload
##### Memory Optimization
- **Chunked Processing**: Large texts processed in manageable pieces
- **Garbage Collection**: Intermediate results cleaned up automatically
- **Memory Monitoring**: Tracks memory usage during processing
- **Resource Limits**: Respects system memory constraints
#### Usage Examples
##### Automatic Processing Mode Selection
```python
# Small text (< 10KB) - Synchronous
context = TextProcessingContext.from_content("Short text here")
# Result: ProcessingMode.SYNC
# Medium text (10KB - 100KB) - Asynchronous
context = TextProcessingContext.from_content(medium_text)
# Result: ProcessingMode.ASYNC
# Large text (> 100KB) - Chunked
context = TextProcessingContext.from_content(large_document)
# Result: ProcessingMode.CHUNKED
```
##### Progress Tracking Example
```python
def progress_callback(current_chunk, total_chunks):
progress_percent = (current_chunk / total_chunks) * 100
print(f"Processing: {progress_percent:.1f}% complete")
processor.process_text_async(
context=context,
processor_func=text_processing_function,
callback=completion_callback,
progress_callback=progress_callback
)
```
#### Integration with Tools
##### Supported Tools
All text processing tools can benefit from async processing:
- **Find & Replace Text**: Large document pattern replacement
- **Email Extraction**: Processing large email archives
- **Word Frequency Counter**: Analyzing large documents
- **Case Tool**: Converting large text files
- **All Other Tools**: Any tool processing large content
##### Automatic Activation
- **Size Detection**: Automatically activates for content > 10KB
- **Tool Integration**: Seamlessly integrated with existing tools
- **Fallback Support**: Gracefully falls back to synchronous processing if unavailable
#### Error Handling and Recovery
##### Cancellation Support
```python
# Cancel specific task
processor.cancel_processing(task_id)
# Cancel all active tasks
processor.cancel_all_tasks()
# Check if task was cancelled
if processor.is_task_cancelled(task_id):
# Handle cancellation
```
##### Error Recovery
- **Partial Results**: Returns partial results when possible
- **Error Reporting**: Detailed error messages and logging
- **Graceful Degradation**: Falls back to synchronous processing on errors
- **Resource Cleanup**: Automatically cleans up resources on errors
#### Performance Monitoring
##### Metrics Collected
- **Processing Time**: Total time for operation completion
- **Chunk Count**: Number of chunks processed
- **Memory Usage**: Peak memory usage during processing
- **Success Rate**: Success/failure statistics
##### Performance Optimization
- **Worker Pool Size**: Configurable number of background workers
- **Chunk Size Tuning**: Adjustable chunk sizes for optimal performance
- **Memory Monitoring**: Automatic memory usage optimization
- **Resource Throttling**: Prevents system overload
#### Best Practices
##### Recommended Usage
- **Large Documents**: Always beneficial for documents > 100KB
- **User Experience**: Provides better experience for any processing > 1 second
- **Resource Management**: Helps manage system resources efficiently
- **Progress Feedback**: Essential for operations taking > 5 seconds
##### Configuration Tips
- **Worker Count**: 2-4 workers optimal for most systems
- **Chunk Size**: 50KB default works well, adjust based on content type
- **Progress Updates**: Update UI every 10-20 chunks for smooth progress
- **Cancellation**: Always provide cancellation option for long operations
##### Common Pitfalls
- **Thread Safety**: Ensure processor functions are thread-safe
- **Memory Leaks**: Properly clean up callbacks and references
- **Error Handling**: Always handle async errors appropriately
- **Resource Limits**: Monitor system resources during heavy processing
#### Advanced Task Management
##### Task Monitoring and Control
```python
# Get active task count
active_count = processor.get_active_task_count()
print(f"Currently processing {active_count} tasks")
# Get detailed task information
task_info = processor.get_active_task_info()
for task_id, info in task_info.items():
print(f"Task {task_id}: {info['tool_name']} - {info['content_size']} bytes")
# Wait for all tasks to complete
completed = processor.wait_for_completion(timeout=30.0)
if not completed:
print("Some tasks are still running after timeout")
```
##### Graceful Shutdown
```python
# Shutdown with waiting for completion
processor.shutdown(wait=True, timeout=10.0)
# Force shutdown without waiting
processor.shutdown(wait=False)
# Global processor shutdown
from core.async_text_processor import shutdown_async_processor
shutdown_async_processor()
```
#### Batch Processing and Resource Management
##### Multiple File Processing Strategy
When processing multiple files or large datasets, the async processor provides several strategies:
**Sequential Processing:**
```python
# Process files one at a time to manage memory
for file_path in file_list:
content = read_file(file_path)
context = TextProcessingContext.from_content(content, tool_name="Batch Process")
# Process with callback
task_id = processor.process_text_async(
context=context,
processor_func=processing_function,
callback=lambda result: handle_file_result(file_path, result)
)
```
**Parallel Processing with Limits:**
```python
# Process multiple files concurrently with resource limits
MAX_CONCURRENT = 3
active_tasks = []
for file_path in file_list:
# Wait if too many active tasks
while len(active_tasks) >= MAX_CONCURRENT:
# Check for completed tasks
active_tasks = [task for task in active_tasks if not task.done()]
time.sleep(0.1)
# Submit new task
context = TextProcessingContext.from_content(content, tool_name="Parallel Batch")
task_id = processor.process_text_async(context, processing_function, callback)
active_tasks.append(task_id)
```
##### Resource Management Best Practices
**Memory Management:**
- **Monitor Memory Usage**: Check system memory before processing large batches
- **Chunk Size Adjustment**: Reduce chunk size for memory-constrained systems
- **Garbage Collection**: Force garbage collection between large operations
- **Memory Limits**: Set processing limits based on available system memory
**CPU Resource Management:**
- **Worker Pool Sizing**: Adjust worker count based on CPU cores (typically cores - 1)
- **Processing Priority**: Use system process priority controls for background processing
- **Thermal Management**: Monitor CPU temperature during intensive operations
- **Load Balancing**: Distribute work across available cores efficiently
**I/O Resource Management:**
- **File Handle Limits**: Manage open file handles when processing many files
- **Disk Space Monitoring**: Check available disk space before large operations
- **Network Resources**: Throttle network operations when processing remote content
- **Temporary File Cleanup**: Clean up temporary files created during processing
##### Batch Processing Examples
**Large Document Collection:**
```python
def process_document_collection(documents, tool_name):
"""Process a collection of documents with progress tracking."""
results = []
total_docs = len(documents)
def document_callback(doc_index, result):
results.append((doc_index, result))
progress = len(results) / total_docs * 100
print(f"Batch progress: {progress:.1f}% ({len(results)}/{total_docs})")
# Process each document
for i, doc_content in enumerate(documents):
context = TextProcessingContext.from_content(
doc_content,
tool_name=tool_name,
callback_id=f"batch_{i}"
)
processor.process_text_async(
context=context,
processor_func=processing_function,
callback=lambda result, idx=i: document_callback(idx, result)
)
# Wait for all to complete
processor.wait_for_completion(timeout=300) # 5 minute timeout
return results
```
**Memory-Efficient Large File Processing:**
```python
def process_large_file_efficiently(file_path, chunk_size=1000000): # 1MB chunks
"""Process very large files by reading in chunks."""
results = []
with open(file_path, 'r', encoding='utf-8') as file:
chunk_num = 0
while True:
chunk = file.read(chunk_size)
if not chunk:
break
context = TextProcessingContext.from_content(
chunk,
tool_name="Large File Processor",
callback_id=f"chunk_{chunk_num}"
)
# Process chunk asynchronously
task_id = processor.process_text_async(
context=context,
processor_func=processing_function,
callback=lambda result: results.append(result)
)
chunk_num += 1
# Limit concurrent chunks to manage memory
if chunk_num % 5 == 0: # Every 5 chunks, wait for completion
processor.wait_for_completion(timeout=60)
# Final wait for all chunks
processor.wait_for_completion()
return combine_chunk_results(results)
```
##### Performance Monitoring for Batch Operations
**Metrics Collection:**
```python
def monitor_batch_performance():
"""Monitor performance during batch operations."""
start_time = time.time()
initial_memory = get_memory_usage()
# Process batch...
# Collect metrics
end_time = time.time()
final_memory = get_memory_usage()
metrics = {
'total_time': end_time - start_time,
'memory_delta': final_memory - initial_memory,
'tasks_processed': processor.get_active_task_count(),
'average_time_per_task': (end_time - start_time) / task_count
}
return metrics
```
**Resource Threshold Management:**
```python
def check_resource_availability():
"""Check if system has sufficient resources for batch processing."""
import psutil
# Check memory
memory = psutil.virtual_memory()
if memory.percent > 80:
return False, "Insufficient memory (>80% used)"
# Check CPU
cpu_percent = psutil.cpu_percent(interval=1)
if cpu_percent > 90:
return False, "High CPU usage (>90%)"
# Check disk space
disk = psutil.disk_usage('/')
if disk.percent > 90:
return False, "Low disk space (>90% used)"
return True, "Resources available"
```
---
### Caching and Optimization Features
**Modules**: `smart_stats_calculator.py`, `content_hash_cache.py`, `regex_pattern_cache.py`
**Availability**: Optional (enhances performance when available)
**Purpose**: Intelligent caching and optimization for improved performance
#### Description
The Caching and Optimization system provides sophisticated caching mechanisms that dramatically improve performance for repeated operations. It includes content-based caching, regex pattern optimization, and intelligent statistics calculation with automatic cache management.
#### Key Components
##### 1. SmartStatsCalculator (`core/smart_stats_calculator.py`)
**Purpose**: Intelligent text statistics calculator with caching and incremental updates
**Core Features:**
- **Content Hash-Based Caching**: Avoids recalculating statistics for unchanged content
- **Incremental Updates**: Efficiently updates statistics for minor text changes
- **Memory Management**: Intelligent cache eviction with configurable memory limits (50MB default)
- **Widget-Aware Caching**: Optimizes cache based on tool switching patterns
- **Advanced Statistics**: Word frequency, reading time, unique word counts
- **Thread-Safe Operations**: Concurrent access support with locking
**Technical Implementation:**
```python
class SmartStatsCalculator:
"""Intelligent text statistics calculator with caching."""
def calculate_stats(self, text: str, widget_id: Optional[str] = None) -> TextStats:
"""Calculate comprehensive text statistics with caching."""
content_hash = self._generate_content_hash(text)
# Check cache first
if content_hash in self.stats_cache:
entry = self.stats_cache[content_hash]
entry.access_count += 1
return entry.stats
# Calculate and cache new stats
stats = self._calculate_stats_impl(text, content_hash)
self._cache_stats(content_hash, stats, len(text), widget_id)
return stats
```
**Statistics Provided:**
- **Basic Counts**: Characters, words, sentences, lines, paragraphs, tokens
- **Advanced Metrics**: Unique words, average word/sentence length, reading time
- **Performance Data**: Calculation time, cache hit status, memory usage
- **Processing Method**: Full calculation, incremental update, or cached result
**Cache Management:**
- **Memory Limits**: 50MB maximum cache size with cleanup at 45MB
- **Eviction Strategy**: LRU + frequency-based intelligent eviction
- **Widget Optimization**: Clears cache when switching tools to free memory
- **Periodic Cleanup**: Background thread removes stale entries every 5 minutes
##### 2. ContentHashCache (`core/content_hash_cache.py`)
**Purpose**: Content hash-based cache for processed text results across all tools
**Core Features:**
- **Tool-Specific Caching**: Different cache settings per tool (priority, TTL)
- **Compression Support**: Optional zlib compression for large results
- **Persistence**: Optional disk persistence across application sessions
- **Intelligent Eviction**: Multi-factor scoring for cache entry value
- **Performance Metrics**: Comprehensive hit rate and timing statistics
**Technical Implementation:**
```python
class ContentHashCache:
"""Intelligent content hash-based cache for processed results."""
def get_cached_result(self, content: str, tool_name: str, tool_settings: Dict) -> Optional[str]:
"""Get cached result for processed content."""
cache_key = self._generate_cache_key(content, tool_name, tool_settings)
if cache_key in self.cache:
result = self.cache[cache_key]
if self._is_result_valid(result, tool_name):
# Update access statistics and return result
result.access_count += 1
result.last_access = time.time()
self.cache.move_to_end(cache_key) # LRU update
return result.content
return None # Cache miss
```
**Tool-Specific Settings:**
- **High Priority Tools**: Case Tool, URL Extractor, Sorters (48-hour TTL)
- **Medium Priority Tools**: Find & Replace, Word Frequency (12-24 hour TTL)
- **Low Priority Tools**: Encoders/Decoders (6-hour TTL)
- **Configurable Limits**: 50MB memory limit, 1000 entry limit
**Cache Value Scoring:**
```python
def _evict_least_valuable_entry(self):
"""Evict using multi-factor scoring algorithm."""
for cache_key, result in self.cache.items():
# Factors: recency, frequency, processing time saved, tool priority, size
recency_score = 1.0 / max(result.age_seconds / 3600, 0.1)
frequency_score = result.access_count / max(result.age_seconds / 3600, 0.1)
time_saved_score = result.processing_time_ms / 100.0
priority_multiplier = {'high': 3.0, 'medium': 2.0, 'low': 1.0}[tool_priority]
size_penalty = result.size_estimate / (1024 * 1024)
score = (recency_score * 0.3 + frequency_score * 0.4 + time_saved_score * 0.2) * priority_multiplier - size_penalty * 0.1
```
##### 3. RegexPatternCache (`core/regex_pattern_cache.py`)
**Purpose**: Caches compiled regex patterns to eliminate compilation overhead
**Core Features:**
- **Pattern Compilation Caching**: Stores compiled regex objects
- **Flag-Aware Caching**: Different cache entries for different regex flags
- **Automatic Cleanup**: Removes unused patterns to manage memory
- **Performance Monitoring**: Tracks pattern usage and compilation savings
#### Caching Strategies
##### Multi-Level Caching Architecture
The application employs a sophisticated multi-level caching strategy:
**Level 1: Statistics Caching (SmartStatsCalculator)**
- Caches text analysis results (word count, line count, etc.)
- Content hash-based with incremental update support
- Widget-aware cache management for tool switching optimization
- Memory-efficient with intelligent eviction
**Level 2: Processing Result Caching (ContentHashCache)**
- Caches final processed results from all tools
- Tool-specific cache policies and TTL settings
- Compression support for large results
- Persistence across application sessions
**Level 3: Pattern Caching (RegexPatternCache)**
- Caches compiled regex patterns
- Eliminates regex compilation overhead
- Automatic pattern optimization
##### Intelligent Cache Key Generation
```python
def _generate_cache_key(self, content: str, tool_name: str, tool_settings: Dict) -> str:
"""Generate unique cache key considering all processing parameters."""
content_hash = hashlib.md5(content.encode('utf-8')).hexdigest()[:16]
settings_str = str(sorted(tool_settings.items()))
key_data = f"{tool_name}_{content_hash}_{settings_str}"
return hashlib.sha256(key_data.encode('utf-8')).hexdigest()[:32]
```
**Key Components:**
- **Content Hash**: MD5 hash of text content for uniqueness
- **Tool Name**: Ensures tool-specific caching
- **Settings Hash**: Includes all tool configuration parameters
- **Collision Avoidance**: SHA256 final hash prevents key collisions
##### Incremental Update Strategy
```python
def calculate_stats_incremental(self, text: str, previous_stats: TextStats, change_info: ChangeInfo) -> TextStats:
"""Efficiently update statistics for minor text changes."""
if not change_info.is_minor_change():
return self.calculate_stats(text) # Fall back to full calculation
# Update statistics incrementally
stats = previous_stats.copy()
if change_info.change_type == "insert":
stats.char_count += len(change_info.inserted_text)
stats.line_count += change_info.inserted_text.count('\n')
stats.word_count += len(change_info.inserted_text.split())
return stats
```
**Incremental Update Benefits:**
- **Performance**: 10-100x faster for minor changes
- **Real-time Updates**: Instant statistics updates during typing
- **Memory Efficient**: Avoids full text reprocessing
- **Fallback Safety**: Automatically falls back to full calculation when needed
##### Cache Warming and Precomputation
```python
def precompute_stats(self, texts: List[str], widget_ids: Optional[List[str]] = None):
"""Precompute statistics for a list of texts (background processing)."""
for i, text in enumerate(texts):
widget_id = widget_ids[i] if widget_ids and i < len(widget_ids) else None
self.calculate_stats(text, widget_id)
```
**Use Cases:**
- **Application Startup**: Precompute common patterns and statistics
- **Tool Switching**: Warm cache for frequently used tools
- **Batch Processing**: Precompute results for known datasets
#### Performance Benefits
##### Cache Hit Rates by Tool Category
**Text Transformation Tools:**
- **Case Tool**: 85-95% hit rate (high repetition of common transformations)
- **Find & Replace**: 60-75% hit rate (varies by pattern complexity)
- **Sorters**: 90-95% hit rate (deterministic results for same input)
**Data Extraction Tools:**
- **Email Extraction**: 70-85% hit rate (common email formats)
- **URL Extraction**: 80-90% hit rate (repeated URL patterns)
- **Header Analysis**: 65-80% hit rate (similar email structures)
**Analysis Tools:**
- **Word Frequency**: 75-90% hit rate (repeated document analysis)
- **Statistics Calculation**: 85-95% hit rate (frequent status bar updates)
- **Diff Viewer**: 50-70% hit rate (depends on comparison patterns)
##### Performance Improvements by Operation Type
**Statistics Calculation:**
```python
# Without caching: 50-200ms for large documents
# With caching: 0.1-1ms for cache hits
# Improvement: 50-2000x faster
```
**Text Processing:**
```python
# Without caching: 100-500ms for complex operations
# With caching: 0.5-2ms for cache hits
# Improvement: 200-1000x faster
```
**Regex Operations:**
```python
# Without caching: 10-50ms for pattern compilation + matching
# With caching: 1-5ms for matching only
# Improvement: 10-50x faster
```
##### Memory Usage Optimization
**Smart Memory Management:**
- **Adaptive Sizing**: Cache size adjusts based on available system memory
- **Compression**: Large results compressed using zlib (30-70% size reduction)
- **Lazy Loading**: Cache entries loaded only when accessed
- **Memory Monitoring**: Continuous monitoring prevents memory exhaustion
**Memory Efficiency Metrics:**
```python
def get_cache_stats(self) -> Dict[str, Any]:
"""Comprehensive cache performance metrics."""
return {
'memory_usage_mb': self.current_memory_usage / (1024 * 1024),
'memory_efficiency': self.cache_hits / max(self.current_memory_usage, 1),
'compression_ratio': self.compressed_size / self.uncompressed_size,
'eviction_rate': self.evictions_per_hour
}
```
#### Advanced Memory Management
##### Intelligent Eviction Algorithms
**Multi-Factor Scoring System:**
```python
def calculate_eviction_score(self, entry: CacheEntry) -> float:
"""Calculate eviction score using multiple factors."""
# Recency factor (0.0-1.0)
recency_score = 1.0 / max(entry.age_seconds / 3600, 0.1)
# Frequency factor (accesses per hour)
frequency_score = entry.access_count / max(entry.age_seconds / 3600, 0.1)
# Processing time saved factor
time_saved_score = entry.processing_time_ms / 100.0
# Tool priority multiplier
priority_multiplier = self.tool_priorities.get(entry.tool_name, 2.0)
# Size penalty (larger entries less valuable)
size_penalty = entry.size_estimate / (1024 * 1024)
# Combined score (higher = more valuable, less likely to evict)
return (recency_score * 0.3 + frequency_score * 0.4 + time_saved_score * 0.2) * priority_multiplier - size_penalty * 0.1
```
**Eviction Strategies:**
- **LRU + Frequency**: Combines recency and access frequency
- **Size-Aware**: Considers memory footprint in eviction decisions
- **Tool Priority**: Prioritizes cache entries for high-value tools
- **Processing Time**: Retains entries that save significant processing time
##### Memory Pressure Handling
**Proactive Memory Management:**
```python
def _evict_by_memory_pressure(self):
"""Evict entries when memory usage exceeds limits."""
target_memory = self.CLEANUP_THRESHOLD_BYTES # 45MB
memory_to_free = self.current_memory_usage - target_memory
# Sort entries by eviction priority
entries_by_priority = sorted(
self.cache.items(),
key=lambda x: self.calculate_eviction_score(x[1]),
reverse=True # Highest score first (least likely to evict)
)
# Evict lowest priority entries until memory target reached
freed_memory = 0
for cache_key, entry in reversed(entries_by_priority):
if freed_memory >= memory_to_free:
break
self.cache.pop(cache_key)
freed_memory += entry.memory_usage
```
**Memory Monitoring:**
- **Continuous Tracking**: Real-time memory usage monitoring
- **Threshold Alerts**: Warnings when approaching memory limits
- **Automatic Cleanup**: Proactive cleanup before memory exhaustion
- **System Integration**: Considers overall system memory availability
##### Cache Persistence and Recovery
**Disk Persistence:**
```python
def _save_cache_to_disk(self):
"""Save cache to disk for persistence across sessions."""
cache_data = {
'cache': {k: v for k, v in self.cache.items() if v.age_seconds < 24 * 3600},
'metrics': self.metrics,
'timestamp': time.time()
}
with open(self.cache_file, 'wb') as f:
pickle.dump(cache_data, f)
```
**Recovery Features:**
- **Session Persistence**: Cache survives application restarts
- **Corruption Recovery**: Graceful handling of corrupted cache files
- **Version Compatibility**: Handles cache format changes
- **Selective Loading**: Loads only recent, valid cache entries
#### Cache Performance Monitoring
##### Real-Time Metrics
**Performance Dashboard:**
```python
def get_comprehensive_stats(self) -> Dict[str, Any]:
"""Get detailed cache performance statistics."""
return {
'hit_rate_percent': self.metrics.hit_rate,
'average_response_time_ms': self.get_average_response_time(),
'memory_efficiency_score': self.calculate_memory_efficiency(),
'cache_effectiveness': self.calculate_cache_effectiveness(),
'tool_performance': self.get_per_tool_performance(),
'eviction_statistics': self.get_eviction_stats(),
'compression_statistics': self.get_compression_stats()
}
```
**Key Performance Indicators:**
- **Hit Rate**: Percentage of requests served from cache
- **Response Time**: Average time to retrieve cached results
- **Memory Efficiency**: Cache hits per MB of memory used
- **Eviction Rate**: Frequency of cache entry evictions
- **Compression Ratio**: Space saved through compression
##### Optimization Recommendations
**Automatic Optimization:**
```python
def optimize_cache_configuration(self):
"""Automatically optimize cache settings based on usage patterns."""
stats = self.get_cache_stats()
# Adjust cache size based on hit rate
if stats['hit_rate_percent'] > 90 and stats['memory_usage_percent'] > 80:
# High hit rate but near memory limit - consider increasing cache size
self.recommend_cache_size_increase()
elif stats['hit_rate_percent'] < 50:
# Low hit rate - analyze cache effectiveness
self.analyze_cache_effectiveness()
# Optimize tool-specific settings
self.optimize_tool_cache_settings()
```
**Performance Tuning Guidelines:**
- **Cache Size**: Increase for high hit rates, decrease for low hit rates
- **TTL Settings**: Adjust based on content change frequency
- **Compression**: Enable for large results, disable for small results
- **Tool Priorities**: Adjust based on actual usage patterns
---
### Regex Optimization and Pattern Library
**Modules**: `optimized_search_highlighter.py`, `regex_pattern_cache.py`, `regex_pattern_library.py`
**Availability**: Always available (core performance enhancement)
**Purpose**: Advanced regex optimization, pattern caching, and comprehensive pattern library
#### Description
The Regex Optimization system provides sophisticated pattern matching capabilities with intelligent caching, progressive highlighting, and a comprehensive library of 20 pre-built regex patterns. The system dramatically improves performance for regex operations while providing advanced search highlighting with non-blocking UI updates.
#### Core Components
##### 1. OptimizedSearchHighlighter (`core/optimized_search_highlighter.py`)
**Purpose**: High-performance search and highlighting with progressive updates and non-blocking operations
**Key Features:**
- **Multiple Highlighting Modes**: Immediate, Progressive, Batch, and Lazy highlighting
- **Background Processing**: Non-blocking search operations using worker threads
- **Progress Tracking**: Real-time progress updates for long operations
- **Cancellation Support**: Ability to cancel long-running search operations
- **Performance Monitoring**: Comprehensive statistics and performance tracking
**Highlighting Modes:**
```python
class HighlightMode(Enum):
IMMEDIATE = "immediate" # Highlight all matches immediately
PROGRESSIVE = "progressive" # Highlight matches progressively with UI updates
BATCH = "batch" # Highlight in batches with controlled processing
LAZY = "lazy" # Highlight only visible area (viewport optimization)
```
**Technical Implementation:**
```python
def search_and_highlight(self, text_widget: tk.Text, pattern: str, tag_name: str = 'search_highlight',
mode: HighlightMode = HighlightMode.PROGRESSIVE, flags: int = 0,
batch_size: Optional[int] = None, max_matches: int = 10000,
progress_callback: Optional[Callable] = None) -> str:
"""Start optimized search and highlight operation."""
# Create search operation with intelligent mode selection
operation = SearchOperation(
pattern=pattern, text_widget=text_widget, tag_name=tag_name,
mode=mode, batch_size=batch_size or self.default_batch_size,
max_matches=max_matches, timeout_ms=self.highlight_timeout_ms
)
# Process in background thread
self.operation_queue.put(operation)
return operation_id
```
**Performance Benefits:**
- **Non-Blocking UI**: Search operations don't freeze the interface
- **Progressive Updates**: Users see results as they're found
- **Memory Efficient**: Processes large texts without memory exhaustion
- **Cancellable Operations**: Long searches can be cancelled by users
##### 2. RegexPatternCache (`core/regex_pattern_cache.py`)
**Purpose**: Intelligent caching system for compiled regex patterns with usage tracking
**Core Features:**
- **Pattern Compilation Caching**: Eliminates repeated regex compilation overhead
- **Usage Statistics**: Tracks access patterns, success rates, and performance metrics
- **Intelligent Eviction**: LRU-based cache management with usage-aware eviction
- **Pattern Optimization**: Automatic pattern optimization based on type detection
**Technical Implementation:**
```python
class RegexPatternCache:
"""Intelligent regex pattern cache with optimization."""
def get_compiled_pattern(self, pattern_string: str, flags: int = 0, pattern_type: str = "regex") -> Optional[Pattern[str]]:
"""Get compiled pattern with caching and optimization."""
cache_key = self._generate_cache_key(pattern_string, flags, pattern_type)
# Check cache first
if cache_key in self.pattern_cache:
entry = self.pattern_cache[cache_key]
entry.access_count += 1
entry.last_access = time.time()
return entry.pattern
# Compile with optimizations
optimized_pattern = self._optimize_pattern(pattern_string, pattern_type)
compiled_pattern = re.compile(optimized_pattern, flags)
# Cache the result
self._cache_pattern(cache_key, PatternCacheEntry(
pattern=compiled_pattern, pattern_string=pattern_string,
flags=flags, compilation_time_ms=compilation_time
))
return compiled_pattern
```
**Pattern Optimizations:**
- **Simple Text Detection**: Automatically escapes non-regex text searches
- **Word Boundary Optimization**: Adds word boundaries for simple word searches
- **Wildcard Conversion**: Converts wildcard patterns to proper regex
- **Performance Monitoring**: Tracks compilation time and usage patterns
**Cache Performance:**
- **Hit Rate**: 85-95% for typical usage patterns
- **Compilation Savings**: 10-50x faster for cached patterns
- **Memory Efficiency**: Intelligent eviction prevents memory bloat
- **Usage Tracking**: Detailed statistics for optimization
##### 3. RegexPatternLibrary (`core/regex_pattern_library.py`)
**Purpose**: Comprehensive library of 20 pre-built, tested regex patterns for common tasks
**Pattern Categories:**
**Data Validation Patterns (8 patterns):**
1. **Email Address Validation**: `^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`
2. **Password Strength**: `^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$`
3. **Phone Number (North American)**: `^\(?(\d{3})\)?[-.\s]?(\d{3})[-.\s]?(\d{4})$`
4. **URL Validation**: `^https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%.\_\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)$`
5. **Username Format**: `^[a-zA-Z0-9]([._-](?![._-])|[a-zA-Z0-9]){3,18}[a-zA-Z0-9]$`
6. **IPv4 Address**: `^((25[0-5]|2[0-4][0-9]|1?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|1?[0-9][0-9]?)$`
7. **Date Format (YYYY-MM-DD)**: `^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$`
8. **Credit Card Numbers**: `^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13})$`
**Information Extraction Patterns (7 patterns):**
1. **Extract Email Addresses**: `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}`
2. **Extract URLs**: `https?:\/\/[^\s/$.?#].[^\s]*`
3. **Extract Hashtags**: `(?<=\s|^)#(\w+)`
4. **Extract @Mentions**: `(?<=\s|^)@(\w{1,15})\b`
5. **Log File Parsing**: `^(?P<ip>[\d.]+) (?P<identd>\S+) (?P<user>\S+) \[(?P<timestamp>.*?)\] \"(?P<request>.*?)\" (?P<status_code>\d{3}) (?P<size>\d+|-).*$`
6. **CSV Field Parsing**: `(?:^|,)("(?:[^"]|"")*"|[^,]*)`
7. **HTML Tag Content**: `<h1.*?>(.*?)<\/h1>`
**Text Cleaning Patterns (5 patterns):**
1. **Strip HTML Tags**: `<[^<]+?>`
2. **Remove Duplicate Words**: `\b(\w+)\s+\1\b`
3. **Trim Whitespace**: `^\s+|\s+$`
4. **Normalize Phone Numbers**: `^\(?(\d{3})\)?[-.\s]?(\d{3})[-.\s]?(\d{4})$`
5. **Mask Sensitive Data**: `\b(\d{4}[- ]?){3}(\d{4})\b`
**Library Usage:**
```python
from core.regex_pattern_library import RegexPatternLibrary
library = RegexPatternLibrary()
# Get all patterns
all_patterns = library.get_all_patterns()
# Get patterns by category
validation_patterns = library.get_validation_patterns()
extraction_patterns = library.get_extraction_patterns()
cleaning_patterns = library.get_cleaning_patterns()
# Search patterns by purpose
email_patterns = library.get_pattern_by_purpose("email")
# Update settings.json with patterns
library.update_settings_file("settings.json")
```
#### Advanced Search Operations
##### Progressive Highlighting Algorithm
```python
def _find_matches_progressive(self, operation: SearchOperation, pattern: re.Pattern, content: str):
"""Find matches progressively with periodic UI updates."""
matches = []
batch_matches = []
last_update_time = time.time()
update_interval = 0.1 # Update UI every 100ms
for match in pattern.finditer(content):
if operation.state == SearchState.CANCELLED:
break
highlight_match = HighlightMatch(
start=match.start(), end=match.end(),
text=match.group(), tag_name=operation.tag_name
)
matches.append(highlight_match)
batch_matches.append(highlight_match)
# Apply highlights in batches for smooth UI updates
if (len(batch_matches) >= operation.batch_size or
time.time() - last_update_time > update_interval):
self._apply_highlights_batch(operation, batch_matches)
batch_matches = []
last_update_time = time.time()
# Update progress
if operation.progress_callback:
operation.progress_callback(operation)
```
##### Lazy Loading for Large Documents
```python
def _find_matches_lazy(self, operation: SearchOperation, pattern: re.Pattern, content: str):
"""Find matches only in visible area (viewport optimization)."""
# Get visible area of text widget
visible_start = operation.text_widget.index("@0,0")
visible_end = operation.text_widget.index(f"@{width},{height}")
# Extract visible content
start_idx = operation.text_widget.count("1.0", visible_start, "chars")[0]
end_idx = operation.text_widget.count("1.0", visible_end, "chars")[0]
visible_content = content[start_idx:end_idx]
# Process only visible content
for match in pattern.finditer(visible_content):
highlight_match = HighlightMatch(
start=start_idx + match.start(),
end=start_idx + match.end(),
text=match.group(),
tag_name=operation.tag_name
)
matches.append(highlight_match)
```
#### Performance Monitoring and Statistics
##### Search Operation Metrics
```python
@dataclass
class SearchProgress:
"""Comprehensive progress tracking for search operations."""
total_chars: int = 0
processed_chars: int = 0
matches_found: int = 0
batches_completed: int = 0
time_elapsed: float = 0.0
estimated_remaining: float = 0.0
@property
def progress_percent(self) -> float:
return (self.processed_chars / max(self.total_chars, 1)) * 100
```
##### Performance Statistics
```python
def get_performance_stats(self) -> Dict[str, Any]:
"""Get comprehensive performance statistics."""
return {
'total_operations': self.performance_stats['total_operations'],
'completed_operations': self.performance_stats['completed_operations'],
'cancelled_operations': self.performance_stats['cancelled_operations'],
'average_processing_time': self.performance_stats['average_processing_time'],
'total_matches_found': self.performance_stats['total_matches_found'],
'cache_hit_rate': self.pattern_cache.get_cache_stats()['hit_rate'],
'active_operations': len(self.active_operations)
}
```
#### Integration with Find & Replace Tool
##### Enhanced Find & Replace Operations
```python
class FindReplaceCache:
"""Specialized cache for find/replace operations."""
def find_with_cache(self, find_text: str, content: str, options: Dict[str, Any]) -> Dict[str, Any]:
"""Perform find operation with comprehensive caching."""
# Check result cache first
operation_key = self._generate_operation_key(find_text, content, options, "find")
if operation_key in self.result_cache:
return self.result_cache[operation_key]
# Use pattern cache for compilation
pattern_type, flags = self._parse_options(options)
search_result = self.pattern_cache.search_with_cache(find_text, content, flags, pattern_type)
# Cache and return results
result = {
'matches': search_result.matches,
'match_count': search_result.match_count,
'search_time_ms': search_result.search_time_ms,
'cache_hit': False
}
self._cache_result(operation_key, result)
return result
```
#### Best Practices and Usage Guidelines
##### Optimal Pattern Usage
**For Simple Text Searches:**
- Use `pattern_type="text"` for literal text matching
- Automatic escaping prevents regex interpretation errors
- Fastest performance for non-regex searches
**For Complex Patterns:**
- Use `pattern_type="regex"` for full regex functionality
- Leverage pattern library for common use cases
- Test patterns with small datasets first
**For Large Documents:**
- Use `HighlightMode.PROGRESSIVE` for responsive UI
- Set appropriate `batch_size` (100-500 matches per batch)
- Enable progress callbacks for user feedback
- Consider `HighlightMode.LAZY` for very large documents
##### Performance Optimization Tips
**Pattern Design:**
- Use specific patterns rather than overly broad ones
- Avoid excessive backtracking in regex patterns
- Leverage word boundaries (`\b`) for word searches
- Use non-capturing groups `(?:...)` when possible
**Cache Management:**
- Patterns are automatically cached and optimized
- Cache hit rates of 85-95% are typical
- Monitor cache statistics for optimization opportunities
- Clear cache periodically for long-running sessions
**Memory Management:**
- Large document highlighting uses chunked processing
- Lazy mode processes only visible content
- Automatic cleanup prevents memory leaks
- Monitor active operations for resource usage
This comprehensive regex optimization system provides powerful, efficient pattern matching capabilities while maintaining excellent performance and user experience even with large documents and complex patterns.
This comprehensive advanced features documentation covers the sophisticated performance optimization and enhancement systems available in Pomera AI Commander, providing users with powerful tools for efficient text processing at scale.---