Skip to main content
Glama
irskep

persistproc

by irskep
process_testing_issues.md6.3 kB
# Debugging Process Termination Failures in E2E Tests This document outlines the investigation into a persistent issue where `pytest` end-to-end tests for the `persistproc run` command would hang, eventually failing with a `TimeoutExpired` error. The core of the problem was that child processes spawned during the tests were not being correctly terminated during test cleanup. ## 1. Initial Problem & Test-Side Fixes The initial problem was observed in two new tests, `test_run_kills_process_on_exit` and `test_run_detach_keeps_process_running`, which use `persistproc run` to manage a long-running script (`tests/scripts/counter.py`). The tests would consistently time out because the `counter.py` process was not being terminated when the test finished. Our first step was to improve the test suite's process management capabilities: 1. **Process Groups (`start_new_session`):** We modified our test helpers (`tests/helpers.py`) to launch both the main `persistproc` server and the `persistproc run` command using `subprocess.Popen` with the `start_new_session=True` argument. This ensures that each subprocess is the leader of a new process group. 2. **Robust Cleanup with `os.killpg`:** We created a new helper function, `stop_run`, which sends signals to the entire process group using `os.killpg(os.getpgid(proc.pid), signal.SIGINT)`. This function implements a robust, staged termination strategy: * It first sends `SIGINT` (graceful shutdown). * It waits and polls for the process to exit. * If the process is still alive, it sends `SIGTERM`. * As a last resort, it sends `SIGKILL`. This approach was also integrated into the `persistproc_server` fixture in `tests/conftest.py` to ensure reliable server cleanup. ## 2. Identifying the Root Cause: The "Double `setsid`" Problem Despite the robust test-side cleanup, the tests continued to hang. Further investigation revealed a "double `setsid`" issue in the application logic itself. The sequence of events was as follows: 1. The test suite starts `persistproc run` in a new process group (**PGID_A**). 2. `persistproc run` connects to the `persistproc` server and requests that it start the `counter.py` script. 3. The server's `ProcessManager` in `persistproc/process_manager.py` would then start `counter.py` using `subprocess.Popen` with `preexec_fn=os.setsid`. This placed the `counter.py` script into a *second*, entirely new process group (**PGID_B**). The test's cleanup logic would correctly send a kill signal to **PGID_A**, terminating the `persistproc run` process. However, the `counter.py` script in **PGID_B** would not receive the signal, becoming an orphaned process and causing the test to hang. ## 3. External Research & Validation Our investigation was guided by several external resources: * **Python Bug Tracker - [Issue 38502: regrtest: use process groups](https://bugs.python.org/issue38502):** This issue in the CPython test runner (`regrtest`) confirmed that our test-side strategy of using `start_new_session=True` and killing the entire process group is a valid and recommended approach for managing subprocess trees during tests. * **Pytest GitHub - [Issue #11174: Test using multiprocessing only hangs when ran with test suite](https://github.com/pytest-dev/pytest/issues/11174):** This issue highlighted the complexities of subprocess management within a test runner on POSIX systems. It explained how the default `fork` start method can lead to hangs due to inherited state (like file descriptors) from the parent `pytest` process. This reinforced our understanding that clean process separation is critical. ## 4. The Implemented Fix Based on the "double `setsid`" diagnosis, the fix was to remove the redundant session creation from the application code. Since the `persistproc` server is already started in a managed session by our test fixture, any children it spawns should inherit its process group. The following change was made in `persistproc/process_manager.py`: ```python # ... try: proc = subprocess.Popen( # noqa: S603 – user command shlex.split(command), cwd=str(working_directory) if working_directory else None, env={**os.environ, **(environment or {})}, stdout=subprocess.PIPE, stderr=subprocess.PIPE, close_fds=True, # The line below was removed. # preexec_fn=os.setsid if os.name != "nt" else None, ) # ... ``` ## 5. A Failed Experiment and Reversal Based on the "double `setsid`" diagnosis, the initial fix attempted was to remove the `preexec_fn=os.setsid` call from `ProcessManager.start_process`. This proved to be incorrect. It caused a major regression, making the previously passing tests (`test_process_lifecycle` and `test_process_restart`) fail with `httpx.RemoteProtocolError: peer closed connection`. This indicated that the server was being terminated when a test client asked it to stop a child process. The conclusion is that the `ProcessManager` **must** spawn children in their own process session to isolate them from the server. The `preexec_fn=os.setsid` call was restored. ## 6. Correcting Test Logic Upon further review, the `test_run_detach_keeps_process_running` test was found to have flawed logic. It was starting a process, then starting a *second* process with `run --on-exit detach`, and then checking the status of the *first* process. The test was rewritten to correctly verify the detach behavior: 1. Start a new process using `run --on-exit detach`. 2. Find the PID of the new process from the server's process list. 3. Terminate the `run` command itself. 4. Verify that the new process is still running. 5. Clean up the orphaned process. ## 7. Current Status (Ongoing) After restoring the `preexec_fn` call and correcting the test logic, we are back to the original state: - `test_server_runs_and_responds` passes. - `test_process_lifecycle` passes. - `test_process_restart` passes. - `test_run_kills_process_on_exit` **fails with a timeout.** - `test_run_detach_keeps_process_running` **fails with a timeout.** The root cause of the timeout in the two `run` command tests is still unknown. The investigation is ongoing.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/irskep/persistproc'

If you have feedback or need assistance with the MCP directory API, please join our Discord server