FGD Fusion Stack Pro

CRASH_ANALYSIS_AND_FIXES.md•13.5 kB

# MCPM Crash Analysis and Fixes **Analysis Date**: 2025-11-06 **Severity Levels**: 🔴 CRITICAL | 🟠 HIGH | 🟡 MEDIUM | 🟢 LOW --- ## Executive Summary Analysis of the MCPM codebase revealed **7 critical crash-prone issues** that can cause: - Application hangs/freezes - Zombie processes - Deadlocks - Resource leaks - Unclean shutdowns All issues have been identified with specific locations and recommended fixes. --- ## 🔴 CRITICAL Issues ### 1. **Subprocess Pipe Deadlock** (CRITICAL) **Location**: `gui_main_pro.py:1289-1317` **Problem**: ```python def _read_subprocess_stdout(self): for line in self.process.stdout: # ← BLOCKS INDEFINITELY # Process line... ``` **Why This Crashes**: - If subprocess doesn't produce output, thread blocks forever - If subprocess crashes, pipe may never close - Daemon threads blocking on I/O can cause ugly exits - Can cause GUI to hang waiting for threads to finish **Impact**: 🔴 **Application hang**, zombie processes **Fix**: ```python def _read_subprocess_stdout(self): """Background thread to read subprocess stdout.""" try: while self.process and self.process.poll() is None: line = self.process.stdout.readline() if not line: break try: decoded = line.decode('utf-8', errors='replace') if self.log_file: with self._log_lock: # Thread-safe file writes with open(self.log_file, 'a') as f: f.write(decoded) f.flush() except Exception as e: logger.error(f"Error writing stdout to log: {e}") break except Exception as e: logger.debug(f"Stdout reader stopped: {e}") ``` **Same issue exists for stderr reader at line 1304-1317.** --- ### 2. **Process Termination Without Cleanup** (CRITICAL) **Location**: `gui_main_pro.py:1321` **Problem**: ```python def toggle_server(self): if self.process and self.process.poll() is None: self.process.terminate() # ← NO WAIT! self.process = None # ← Immediate cleanup ``` **Why This Crashes**: - Sets `self.process = None` immediately after terminate - Daemon threads still reading from pipes crash when process is None - No `.wait()` call = zombie process - Race condition: threads may try to read from None process **Impact**: 🔴 **Zombie processes**, thread crashes, resource leaks **Fix**: ```python def toggle_server(self): if self.process and self.process.poll() is None: logger.info("Stopping MCP backend process...") self.process.terminate() try: # Wait up to 5 seconds for clean shutdown self.process.wait(timeout=5) logger.info("Process terminated cleanly") except subprocess.TimeoutExpired: logger.warning("Process didn't stop, forcing kill...") self.process.kill() self.process.wait() # Wait for kill to complete # Now safe to set to None self.process = None self.connection_status.set_status("stopped", "🔴 Server stopped") self.start_btn.setText("▶ Start Server") ``` --- ### 3. **Infinite Approval Monitor Loop** (CRITICAL) **Location**: `mcp_backend.py:293-303` **Problem**: ```python async def _approval_monitor_loop(self): while True: # ← INFINITE LOOP, NO CANCELLATION try: await asyncio.sleep(2) # Process approvals... except Exception as e: logger.debug(f"Approval monitor error: {e}") ``` **Why This Crashes**: - Infinite loop with no exit condition - If task isn't properly cancelled on shutdown, hangs forever - Exception handling catches ALL exceptions (including cancellation) - No reference to task = can't cancel it **Impact**: 🔴 **Application hang on exit**, resource leak **Fix**: ```python async def _approval_monitor_loop(self): """Background loop to check for approval files and auto-apply edits.""" try: while True: await asyncio.sleep(2) approval_file = self.watch_dir / ".fgd_approval.json" if not approval_file.exists(): continue # Process approvals... except asyncio.CancelledError: logger.info("Approval monitor cancelled, shutting down cleanly") raise # Re-raise to allow proper cancellation except Exception as e: logger.error(f"Approval monitor error: {e}", exc_info=True) ``` **AND store the task reference:** ```python def __init__(self, config_path: str): # ...existing init code... self._approval_task = None # Track the task async def run(self): logger.info("MCP Server starting...") # Start approval monitor and STORE reference self._approval_task = asyncio.create_task(self._approval_monitor_loop()) try: async with stdio_server() as (read, write): await self.server.run(read, write, self.server.create_initialization_options()) finally: # Clean shutdown if self._approval_task and not self._approval_task.done(): self._approval_task.cancel() try: await self._approval_task except asyncio.CancelledError: pass ``` --- ### 4. **Observer Thread Join Without Timeout** (HIGH) **Location**: `mcp_backend.py:759-762` **Problem**: ```python def stop(self): if self.observer: self.observer.stop() self.observer.join() # ← CAN HANG FOREVER ``` **Why This Crashes**: - `join()` without timeout can hang indefinitely - If observer thread is stuck, application freezes on exit - No error handling if join fails **Impact**: 🟠 **Application freeze on exit** **Fix**: ```python def stop(self): if self.observer: logger.info("Stopping file watcher...") self.observer.stop() # Join with timeout to prevent hanging self.observer.join(timeout=5.0) if self.observer.is_alive(): logger.warning("File watcher thread did not stop cleanly") else: logger.info("File watcher stopped cleanly") self.observer = None ``` --- ## 🟠 HIGH Priority Issues ### 5. **File I/O Race Condition** (HIGH) **Location**: `gui_main_pro.py:1296, 1311` **Problem**: - Two daemon threads writing to same log file - No file locking or synchronization - Concurrent writes can corrupt log file or cause IOError **Impact**: 🟠 **Corrupted logs**, potential crashes **Fix**: ```python import threading class FGDGUI(QWidget): def __init__(self): # ...existing init code... self._log_lock = threading.Lock() def _read_subprocess_stdout(self): try: while self.process and self.process.poll() is None: line = self.process.stdout.readline() if not line: break decoded = line.decode('utf-8', errors='replace') if self.log_file: with self._log_lock: # Thread-safe with open(self.log_file, 'a') as f: f.write(decoded) f.flush() except Exception as e: logger.debug(f"Stdout reader stopped: {e}") ``` --- ### 6. **Process Object Race Condition** (HIGH) **Location**: `gui_main_pro.py:1292, 1320` **Problem**: ```python # Thread 1 (daemon): for line in self.process.stdout: # Reading process # Thread 2 (main): self.process.terminate() self.process = None # ← RACE: Thread 1 crashes! ``` **Impact**: 🟠 **Thread crashes**, attribute errors **Fix**: ```python def toggle_server(self): if self.process and self.process.poll() is None: # Store reference before setting to None process_to_stop = self.process logger.info("Stopping MCP backend process...") process_to_stop.terminate() try: process_to_stop.wait(timeout=5) logger.info("Process terminated cleanly") except subprocess.TimeoutExpired: logger.warning("Process didn't stop, forcing kill...") process_to_stop.kill() process_to_stop.wait() # Now safe to set to None (daemon threads will exit on pipe close) self.process = None ``` --- ### 7. **Missing Subprocess Cleanup on Close** (HIGH) **Location**: `gui_main_pro.py:1689-1702` **Problem**: ```python def closeEvent(self, event): if self.process: self.process.terminate() try: self.process.wait(timeout=2) except Exception: pass # ← Silent failure, no kill ``` **Why This Is Bad**: - If process doesn't stop in 2 seconds, leaves zombie - No `.kill()` fallback - Daemon threads may still be blocked on pipes **Impact**: 🟠 **Zombie processes** on application exit **Fix**: ```python def closeEvent(self, event): """Clean shutdown of all resources.""" logger.info("Application closing, cleaning up...") # Stop timers if hasattr(self, "timer") and self.timer.isActive(): self.timer.stop() if hasattr(self, "_header_timer") and self._header_timer.isActive(): self._header_timer.stop() # Stop subprocess with proper cleanup if self.process: logger.info("Terminating MCP backend...") self.process.terminate() try: self.process.wait(timeout=5) logger.info("Backend stopped cleanly") except subprocess.TimeoutExpired: logger.warning("Backend didn't stop, forcing kill...") self.process.kill() try: self.process.wait(timeout=2) except subprocess.TimeoutExpired: logger.error("Backend refused to die!") except Exception as e: logger.error(f"Error stopping backend: {e}") # Close pop-out windows for window in self.pop_out_windows: try: window.close() except Exception as e: logger.debug(f"Error closing pop-out window: {e}") event.accept() logger.info("Application closed") ``` --- ## 🟡 MEDIUM Priority Issues ### 8. **No Thread Cleanup Tracking** (MEDIUM) **Location**: `gui_main_pro.py:1396-1399` **Problem**: - Daemon threads started but never tracked - No way to check if they're still running - No graceful shutdown mechanism **Fix**: Add thread tracking and shutdown signals: ```python def __init__(self): # ...existing init code... self._shutdown_event = threading.Event() self._stdout_thread = None self._stderr_thread = None def _read_subprocess_stdout(self): try: while not self._shutdown_event.is_set(): if not self.process or self.process.poll() is not None: break # Read with timeout to allow checking shutdown flag # ... (implementation using select or similar) except Exception as e: logger.debug(f"Stdout reader stopped: {e}") def stop(self): # Signal threads to stop self._shutdown_event.set() # Stop process # ... (existing stop code) ``` --- ## Testing Recommendations ### Crash Testing Scenarios 1. **Subprocess Crash Test**: ```bash # Start GUI, then kill backend manually ps aux | grep mcp_backend kill -9 <PID> # GUI should handle gracefully ``` 2. **Rapid Start/Stop Test**: ```python # Click Start → Stop → Start → Stop rapidly # Should not crash or leave zombies ``` 3. **Force Quit Test**: ```bash # Close GUI window during operation # Check for zombie processes: ps aux | grep defunct ``` 4. **Long-Running Test**: ```bash # Run for hours, then close # Check memory usage doesn't grow # Verify clean shutdown ``` --- ## Implementation Priority ### Phase 1 - Critical Fixes (Do First) 1. ✅ Fix subprocess pipe deadlock (Issue #1) 2. ✅ Fix process termination cleanup (Issue #2) 3. ✅ Fix infinite approval monitor loop (Issue #3) ### Phase 2 - High Priority 4. ✅ Add observer join timeout (Issue #4) 5. ✅ Add file I/O locking (Issue #5) 6. ✅ Fix process race condition (Issue #6) ### Phase 3 - Medium Priority 7. ✅ Improve closeEvent cleanup (Issue #7) 8. ✅ Add thread tracking (Issue #8) --- ## Monitoring and Prevention ### Add Crash Logging ```python import atexit def _cleanup_on_exit(): """Emergency cleanup on abnormal exit.""" logger.critical("Emergency cleanup triggered!") # Log process state, thread count, etc. atexit.register(_cleanup_on_exit) ``` ### Add Health Checks ```python def _health_check(self): """Periodic health check to detect hung state.""" if self.process: if self.process.poll() is not None: # Process died unexpectedly! logger.error("Backend process died unexpectedly!") self.connection_status.set_status("error", "🔴 Backend crashed") ``` --- ## Summary **Total Issues Found**: 8 **Critical**: 4 🔴 **High**: 3 🟠 **Medium**: 1 🟡 **Primary Crash Causes**: 1. Blocking I/O on subprocess pipes 2. Missing process cleanup 3. Uncancellable async tasks 4. Thread joins without timeouts 5. Race conditions on shared resources **Estimated Fix Time**: 4-6 hours **Testing Time**: 2-3 hours All issues are fixable with the provided solutions. Priority should be given to Critical issues as they cause immediate application hangs and crashes.

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/mikeychann-hash/MCPM'

If you have feedback or need assistance with the MCP directory API, please join our Discord server