Avoiding Random Ansible Runner Hangs in Async Multithreaded Python

I recently hit an annoying failure mode while running Ansible Runner from a Python program that uses both asyncio and worker threads. Most runs completed normally. Every now and then, a runner call just stopped making progress. No obvious exception. No clear reproduction. Just one of those “why is this process still here?” moments.

The clue was an old ptyprocess issue: PtyProcess.spawn is not safe for use in multithreaded applications. The maintainer note there describes the likely symptom as random-seeming deadlocks and hangs, which matched the behaviour almost too well.

Table of Contents

The shape of the problem

Ansible Runner uses pexpect around the child process it starts. That is convenient because pexpect can watch output, answer prompts, and keep the runner loop simple.

The uncomfortable part is the default PTY-backed spawn path. Under the hood, it relies on a fork()-to-exec() sequence with Python-level work in the middle. In a single-threaded program, this is usually boring enough. In a multithreaded program, it becomes much less boring.

After fork(), only the calling thread exists in the child process. Locks that were held by other threads may still look held, but the threads that could release them no longer exist. If the child then runs code that touches one of those locks before exec(), it can freeze in a place that looks unrelated to your application logic.

That is why this class of bug feels random. The failure depends on timing: which thread held what, exactly when the runner spawned a child, and what the child touched before exec().

Why `asyncio` made it easier to notice

asyncio itself was not the bug. The event loop was just the place where the symptom became visible.

In my case, runner work was launched from async workflows and delegated into threads so that blocking Ansible execution would not stop the event loop. That means the process had at least three moving parts:

an event loop scheduling higher-level tasks;
worker threads calling into Ansible Runner;
pexpect spawning and supervising child processes.

This is exactly the kind of runtime where a “mostly fine” fork-safety issue can turn into a real production headache.

The workaround I chose

The practical workaround was not to make the PTY path safer. I replaced it at the integration boundary.

pexpect also has popen_spawn.PopenSpawn, which uses subprocess.Popen instead of a PTY-backed spawn. For my use case, I did not need a real terminal. I needed a child process, stdout/stderr handling, pattern matching, timeouts, and predictable termination. A subprocess-backed spawn was a better fit.

The patch had four parts:

Create a small spawn adapter on top of PopenSpawn.
Start each runner child in its own process group and make termination signal the whole group.
Preserve the lifecycle behaviour Ansible Runner expects, including timeout handling, process status, and resource cleanup.
Monkey patch the runner’s pexpect.spawn reference early enough that runner instances use the adapter.

The adapter is not just a constructor wrapper. It also carries the close, timeout, liveness, process-status, pipe-cleanup, and process-group termination semantics needed by the runner loop. This exact patch is for Linux and Unix-like systems: it relies on POSIX process groups through os.setsid, os.killpg, and preexec_fn. Stripped of project-specific logging, the relevant code is:

import os
import signal
import time
from collections.abc import Awaitable
from typing import Any

from pexpect.exceptions import ExceptionPexpect, TIMEOUT
from pexpect.popen_spawn import PopenSpawn


class RunnerSpawn(PopenSpawn):
    """An Ansible runner spawn."""

    def __init__(self, cmd: str, args: list[str] | None = None, **kwargs: Any) -> None:
        """Initialise an Ansible runner spawn.

        Parameters
        ----------
        cmd : str
            A command.
        args : list[str] | None, default: `None`
            Command arguments (default: `None`). Do not provide these if they have already been included in ``cmd``.
        **kwargs : dict[str, typing.Any], optional
            Keyword arguments.

        Notes
        -----
        ``preexec_fn`` defaults to `os.setsid` so that the subprocess and all its descendants share a process group.
        This ensures `close()` and `kill()` can terminate the entire process tree, preventing orphaned child processes
        when ansible-runner spawns nested subprocesses.
        """
        cmd: list[str] = [cmd]
        if args:
            cmd.extend(args)

        kwargs.setdefault("preexec_fn", os.setsid)
        try:
            super().__init__(
                cmd=cmd,
                **{
                    key: value
                    for key, value in kwargs.items()
                    if key
                    in (
                        "codec_errors",
                        "cwd",
                        "encoding",
                        "env",
                        "logfile",
                        "maxread",
                        "preexec_fn",
                        "searchwindowsize",
                        "timeout",
                    )
                },
            )
        except Exception as exception:
            raise ExceptionPexpect(f"Failed to initialise an Ansible runner spawn with the cmd: {cmd}") from exception

    def close(self, force=True) -> None:
        """Close an Ansible runner spawn.

        Parameters
        ----------
        force : bool, default: `True`
            A flag denoting whether to force closing an Ansible runner spawn after failed graceful termination (`True`,
            default) or not (`False`).

        Notes
        -----
        `ptyprocess.PtyProcess.terminate()` is a reference.
        """
        if self.closed:
            return

        try:
            if not self.isalive():
                return

            self.kill(sig=signal.SIGHUP)
            time.sleep(0.1)
            if not self.isalive():
                return

            self.kill(sig=signal.SIGCONT)
            time.sleep(0.1)
            if not self.isalive():
                return

            self.kill(sig=signal.SIGINT)
            if not force:
                return

            time.sleep(0.1)
            if not self.isalive():
                return

            self.kill(sig=signal.SIGKILL)
        except OSError:
            pass
        finally:
            try:
                if not self.isalive():
                    proc = getattr(self, "proc", None)
                    if proc is not None:
                        for pipe_name in ("stdin", "stdout"):
                            pipe: Any = getattr(proc, pipe_name, None)
                            if pipe is None or pipe.closed:
                                continue

                            try:
                                pipe.close()
                            except OSError:
                                pass
            finally:
                self.closed = True

    def expect(self, *args, **kwargs) -> Awaitable[int] | int | None:
        try:
            return super().expect(*args, **kwargs)
        except TIMEOUT:
            # `ansible_runner.runner.Runner.run()` contains a fix to expect an EOF in at most 5 seconds. After monkey
            # patching `ansible_runner.runner.pexpect.spawn`, a timeout exception causes unexpected resource leaks.
            # Reference: https://github.com/ansible/ansible-runner/issues/1330
            return None

    def isalive(self) -> bool:
        """Check if an Ansible runner spawn's process is alive.

        Returns
        -------
        bool
            A flag denoting whether an Ansible runner spawn's process is alive (`True`) or not (`False`).
        """
        if not (alive := self.proc.poll() is None):
            if self.proc.returncode >= 0:
                self.exitstatus = self.proc.returncode
                self.signalstatus = None
            else:
                self.exitstatus = None
                self.signalstatus = -self.proc.returncode

        return alive

    def kill(self, sig) -> None:
        """Send a signal to the process group of an Ansible runner spawn.

        Parameters
        ----------
        sig : int
            A signal.
        """
        os.killpg(os.getpgid(self.proc.pid), sig)

Then I patch the runner spawn entry point before creating runner objects:

ansible_runner.runner.pexpect.spawn = RunnerSpawn

The process-group detail matters because Ansible Runner may spawn nested subprocesses. Killing only the direct child can leave descendants behind. By setting preexec_fn to os.setsid, the adapter creates a fresh process group for the runner tree; by overriding kill() to call os.killpg(os.getpgid(self.proc.pid), sig), the close sequence signals the whole group.

Process termination is only part of cleanup. PopenSpawn owns the subprocess’s stdin and stdout pipes and uses a reader thread to consume output. Once the process is no longer alive, close() closes both pipes, allowing those resources and the reader to finish cleanly. It skips cleanup immediately when the spawn is already closed, only closes pipes after the child has exited, tolerates individual pipe-close failures, and still marks the spawn as closed in every path.

The liveness check also translates subprocess return codes into the status fields expected by pexpect: a non-negative return code becomes exitstatus, while a negative return code represents termination by a signal and becomes signalstatus. Keeping those two cases separate prevents a signal number from being mistaken for an ordinary process exit code.

The timeout handling is also part of the adapter on purpose. Ansible Runner has had fixes around draining final output until EOF so that events are not lost near process shutdown; see ansible-runner issue #1330 for the general shape of that race. After replacing the spawn backend, letting a TIMEOUT escape from expect() could short-circuit cleanup and leak resources, so the adapter returns None.

That is the main lesson here: the monkey patch is small, but the compatibility surface is not only spawn(cmd). It includes liveness, exit and signal status, timeout behaviour, pipe cleanup, idempotent closing, and process-tree termination. The replacement should look boringly compatible to the runner loop.

Things I would check before copying this pattern

This workaround is not a universal “replace pexpect with subprocess” rule.

It is a good fit when:

your runner commands do not genuinely require PTY behaviour;
you are embedding runner calls inside a multithreaded Python process;
your runtime is Linux or another Unix-like system with POSIX process-group semantics;
your hangs line up with child-process spawning rather than Ansible task logic;
you can patch the spawn path before any runner objects are created.

It is a poor fit when:

the child process needs terminal-specific behaviour;
your automation depends on exact PTY semantics;
you need the same code path to work on Windows;
you cannot control import order;
you have not tested cancellation, timeout, and signal handling.

Takeaway

When a Python program mixes asyncio, threads, and child-process orchestration, fork() details stop being trivia. A hang may not come from the async task you are staring at. It may come from a child process being born in a runtime where another thread held the wrong lock at the wrong time.

For my case, moving Ansible Runner away from the PTY-backed pexpect spawn path made the runner lifecycle much more predictable. The patch is intentionally unglamorous: keep Runner’s expectations intact, use a subprocess-backed spawn where a PTY is unnecessary, terminate the whole process group, and release the subprocess resources once it exits.

That is often the best kind of production fix. Not clever. Just less surprising.

Trending Searches

No results found

Avoiding Random Ansible Runner Hangs in Async Multithreaded Python

The shape of the problem

Why `asyncio` made it easier to notice

The workaround I chose

Things I would check before copying this pattern

Takeaway

Related

Trending Searches

No results found

Avoiding Random Ansible Runner Hangs in Async Multithreaded Python

The shape of the problem

Why asyncio made it easier to notice

The workaround I chose

Things I would check before copying this pattern

Takeaway

Related

Why `asyncio` made it easier to notice