persistproc

Overview Schema Related Servers Score Discussions

persistproc
internal_docs

process_testing_issues.md

process_testing_issues.md•6.15 KiB

# Debugging Process Termination Failures in E2E Tests This document outlines the investigation into a persistent issue where `pytest` end-to-end tests for the `persistproc run` command would hang, eventually failing with a `TimeoutExpired` error. The core of the problem was that child processes spawned during the tests were not being correctly terminated during test cleanup. ## 1. Initial Problem & Test-Side Fixes The initial problem was observed in two new tests, `test_run_kills_process_on_exit` and `test_run_detach_keeps_process_running`, which use `persistproc run` to manage a long-running script (`tests/scripts/counter.py`). The tests would consistently time out because the `counter.py` process was not being terminated when the test finished. Our first step was to improve the test suite's process management capabilities: 1. **Process Groups (`start_new_session`):** We modified our test helpers (`tests/helpers.py`) to launch both the main `persistproc` server and the `persistproc run` command using `subprocess.Popen` with the `start_new_session=True` argument. This ensures that each subprocess is the leader of a new process group. 2. **Robust Cleanup with `os.killpg`:** We created a new helper function, `stop_run`, which sends signals to the entire process group using `os.killpg(os.getpgid(proc.pid), signal.SIGINT)`. This function implements a robust, staged termination strategy: * It first sends `SIGINT` (graceful shutdown). * It waits and polls for the process to exit. * If the process is still alive, it sends `SIGTERM`. * As a last resort, it sends `SIGKILL`. This approach was also integrated into the `persistproc_server` fixture in `tests/conftest.py` to ensure reliable server cleanup. ## 2. Identifying the Root Cause: The "Double `setsid`" Problem Despite the robust test-side cleanup, the tests continued to hang. Further investigation revealed a "double `setsid`" issue in the application logic itself. The sequence of events was as follows: 1. The test suite starts `persistproc run` in a new process group (**PGID_A**). 2. `persistproc run` connects to the `persistproc` server and requests that it start the `counter.py` script. 3. The server's `ProcessManager` in `persistproc/process_manager.py` would then start `counter.py` using `subprocess.Popen` with `preexec_fn=os.setsid`. This placed the `counter.py` script into a *second*, entirely new process group (**PGID_B**). The test's cleanup logic would correctly send a kill signal to **PGID_A**, terminating the `persistproc run` process. However, the `counter.py` script in **PGID_B** would not receive the signal, becoming an orphaned process and causing the test to hang. ## 3. External Research & Validation Our investigation was guided by several external resources: * **Python Bug Tracker - [Issue 38502: regrtest: use process groups](https://bugs.python.org/issue38502):** This issue in the CPython test runner (`regrtest`) confirmed that our test-side strategy of using `start_new_session=True` and killing the entire process group is a valid and recommended approach for managing subprocess trees during tests. * **Pytest GitHub - [Issue #11174: Test using multiprocessing only hangs when ran with test suite](https://github.com/pytest-dev/pytest/issues/11174):** This issue highlighted the complexities of subprocess management within a test runner on POSIX systems. It explained how the default `fork` start method can lead to hangs due to inherited state (like file descriptors) from the parent `pytest` process. This reinforced our understanding that clean process separation is critical. ## 4. The Implemented Fix Based on the "double `setsid`" diagnosis, the fix was to remove the redundant session creation from the application code. Since the `persistproc` server is already started in a managed session by our test fixture, any children it spawns should inherit its process group. The following change was made in `persistproc/process_manager.py`: ```python # ... try: proc = subprocess.Popen( # noqa: S603 – user command shlex.split(command), cwd=str(working_directory) if working_directory else None, env={**os.environ, **(environment or {})}, stdout=subprocess.PIPE, stderr=subprocess.PIPE, close_fds=True, # The line below was removed. # preexec_fn=os.setsid if os.name != "nt" else None, ) # ... ``` ## 5. A Failed Experiment and Reversal Based on the "double `setsid`" diagnosis, the initial fix attempted was to remove the `preexec_fn=os.setsid` call from `ProcessManager.start_process`. This proved to be incorrect. It caused a major regression, making the previously passing tests (`test_process_lifecycle` and `test_process_restart`) fail with `httpx.RemoteProtocolError: peer closed connection`. This indicated that the server was being terminated when a test client asked it to stop a child process. The conclusion is that the `ProcessManager` **must** spawn children in their own process session to isolate them from the server. The `preexec_fn=os.setsid` call was restored. ## 6. Correcting Test Logic Upon further review, the `test_run_detach_keeps_process_running` test was found to have flawed logic. It was starting a process, then starting a *second* process with `run --on-exit detach`, and then checking the status of the *first* process. The test was rewritten to correctly verify the detach behavior: 1. Start a new process using `run --on-exit detach`. 2. Find the PID of the new process from the server's process list. 3. Terminate the `run` command itself. 4. Verify that the new process is still running. 5. Clean up the orphaned process. ## 7. Current Status (Ongoing) After restoring the `preexec_fn` call and correcting the test logic, we are back to the original state: - `test_server_runs_and_responds` passes. - `test_process_lifecycle` passes. - `test_process_restart` passes. - `test_run_kills_process_on_exit` **fails with a timeout.** - `test_run_detach_keeps_process_running` **fails with a timeout.** The root cause of the timeout in the two `run` command tests is still unknown. The investigation is ongoing.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/irskep/persistproc'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

process_testing_issues.md•6.15 KiB