Auto-Retry¶

Subagent runs are automatically retried on transient networking failures. Model gateways and proxies (e.g. LiteLLM) occasionally return 502/503/504, hit a 429 rate limit, or drop a connection. Rather than failing the whole delegation, the toolset retries the subagent with exponential backoff.

Crucially, each retry resumes with the full accumulated message history from the failed attempt, so partial progress (completed model turns and tool calls) is not thrown away — the subagent continues instead of restarting from scratch.

Defaults¶

Retrying is on by default: a subagent gets 3 extra attempts after the first failure, with exponential backoff and jitter. You opt out by setting max_retries=0, which restores the legacy agent.run() path.

Per-Subagent Configuration¶

Retry behaviour is configured through retry_* fields on SubAgentConfig. Any field you omit falls back to the default policy.

Python

from subagents_pydantic_ai import SubAgentConfig

SubAgentConfig(
    name="researcher",
    description="Researches topics",
    instructions="You are a research assistant.",
    max_retries=5,                 # extra attempts after the first failure (default 3)
    retry_initial_delay=1.0,       # seconds before the first retry (default 1.0)
    retry_max_delay=30.0,          # cap on the backoff delay (default 30.0)
    retry_backoff_multiplier=2.0,  # delay growth factor per attempt (default 2.0)
    retry_jitter=True,             # randomise delay in [0, delay] (default True)
)

Field	Type	Default	Description
`max_retries`	`int`	`3`	Extra attempts after the first failure. `0` disables retrying
`retry_initial_delay`	`float`	`1.0`	Seconds to wait before the first retry
`retry_max_delay`	`float`	`30.0`	Upper bound on the backoff delay
`retry_backoff_multiplier`	`float`	`2.0`	Delay multiplier applied each attempt
`retry_jitter`	`bool`	`True`	Randomise the delay in `[0, delay]` (full jitter) to avoid a thundering herd across concurrent subagents
`retry_on`	`Callable[[BaseException], bool]`	built-in classifier	Custom predicate deciding whether an exception is transient

These fields are resolved into a RetryConfig via RetryConfig.from_config(config).

What Counts as Transient¶

By default, is_transient_error decides whether a failure is worth retrying:

A ModelHTTPError with status 408, 409, 425, 429, 500, 502, 503, 504, or 529 — gateway hiccups, rate limits, or upstream overload.
A bare ModelAPIError (no HTTP status) — connection resets, read timeouts, and other transport-level failures from the model client.

Everything else — auth/4xx errors, UnexpectedModelBehavior, UsageLimitExceeded, UserError, validation errors, and task cancellation (asyncio.CancelledError) — is treated as non-transient and is not retried.

Custom Classification¶

Provide your own predicate via retry_on to override the default classifier:

Python

from pydantic_ai.exceptions import ModelHTTPError

def only_rate_limits(exc: BaseException) -> bool:
    return isinstance(exc, ModelHTTPError) and exc.status_code == 429

SubAgentConfig(
    name="researcher",
    description="Researches topics",
    instructions="You are a research assistant.",
    retry_on=only_rate_limits,
)

Backoff Delay¶

compute_backoff_delay computes the wait before each retry (1-based attempt):

Text Only

base  = initial_delay * (backoff_multiplier ** (attempt - 1))
delay = min(base, max_delay)
# with jitter: delay = random.uniform(0.0, delay)  (full jitter)

With the defaults this yields roughly 1s, 2s, 4s, ... capped at 30s, each randomised down to [0, delay] when jitter is enabled.

Observing Retries¶

While a task is waiting between attempts, its status is TaskStatus.RETRYING. The number of retries performed for a task is tracked on TaskHandle.retry_count.

Python

handle = task_manager.get_handle(task_id)
if handle.status == TaskStatus.RETRYING:
    print(f"Retrying (attempt {handle.retry_count})")

Under the Hood¶

run_with_retry drives the retry loop. When max_retries > 0, it runs the agent via Agent.iter() so that, on a transient failure, the accumulated history from the failed attempt is replayed via message_history on the next attempt. When max_retries <= 0, it falls through to a plain agent.run() — the legacy path, unchanged. Event streaming (event_stream_handler and wrap_run_event_stream capabilities) keeps working across retries, and cooperative (soft) cancellation is honoured at node boundaries on the retry-driven path.

See the Prompts & Retry API for full signatures.

Next Steps¶

Execution Modes - Sync vs async delegation
Cancellation - Stopping running tasks