Auto-Retry¶
Subagent runs are automatically retried on transient networking failures. Model
gateways and proxies (e.g. LiteLLM) occasionally return 502/503/504, hit a
429 rate limit, or drop a connection. Rather than failing the whole delegation,
the toolset retries the subagent with exponential backoff.
Crucially, each retry resumes with the full accumulated message history from the failed attempt, so partial progress (completed model turns and tool calls) is not thrown away — the subagent continues instead of restarting from scratch.
Defaults¶
Retrying is on by default: a subagent gets 3 extra attempts after the first
failure, with exponential backoff and jitter. You opt out by setting
max_retries=0, which restores the legacy agent.run() path.
Per-Subagent Configuration¶
Retry behaviour is configured through retry_* fields on
SubAgentConfig. Any field you omit
falls back to the default policy.
from subagents_pydantic_ai import SubAgentConfig
SubAgentConfig(
name="researcher",
description="Researches topics",
instructions="You are a research assistant.",
max_retries=5, # extra attempts after the first failure (default 3)
retry_initial_delay=1.0, # seconds before the first retry (default 1.0)
retry_max_delay=30.0, # cap on the backoff delay (default 30.0)
retry_backoff_multiplier=2.0, # delay growth factor per attempt (default 2.0)
retry_jitter=True, # randomise delay in [0, delay] (default True)
)
| Field | Type | Default | Description |
|---|---|---|---|
max_retries |
int |
3 |
Extra attempts after the first failure. 0 disables retrying |
retry_initial_delay |
float |
1.0 |
Seconds to wait before the first retry |
retry_max_delay |
float |
30.0 |
Upper bound on the backoff delay |
retry_backoff_multiplier |
float |
2.0 |
Delay multiplier applied each attempt |
retry_jitter |
bool |
True |
Randomise the delay in [0, delay] (full jitter) to avoid a thundering herd across concurrent subagents |
retry_on |
Callable[[BaseException], bool] |
built-in classifier | Custom predicate deciding whether an exception is transient |
These fields are resolved into a
RetryConfig via
RetryConfig.from_config(config).
What Counts as Transient¶
By default, is_transient_error
decides whether a failure is worth retrying:
- A
ModelHTTPErrorwith status408,409,425,429,500,502,503,504, or529— gateway hiccups, rate limits, or upstream overload. - A bare
ModelAPIError(no HTTP status) — connection resets, read timeouts, and other transport-level failures from the model client.
Everything else — auth/4xx errors, UnexpectedModelBehavior,
UsageLimitExceeded, UserError, validation errors, and task cancellation
(asyncio.CancelledError) — is treated as non-transient and is not
retried.
Custom Classification¶
Provide your own predicate via retry_on to override the default classifier:
from pydantic_ai.exceptions import ModelHTTPError
def only_rate_limits(exc: BaseException) -> bool:
return isinstance(exc, ModelHTTPError) and exc.status_code == 429
SubAgentConfig(
name="researcher",
description="Researches topics",
instructions="You are a research assistant.",
retry_on=only_rate_limits,
)
Backoff Delay¶
compute_backoff_delay
computes the wait before each retry (1-based attempt):
base = initial_delay * (backoff_multiplier ** (attempt - 1))
delay = min(base, max_delay)
# with jitter: delay = random.uniform(0.0, delay) (full jitter)
With the defaults this yields roughly 1s, 2s, 4s, ... capped at 30s,
each randomised down to [0, delay] when jitter is enabled.
Observing Retries¶
While a task is waiting between attempts, its status is
TaskStatus.RETRYING. The number of
retries performed for a task is tracked on
TaskHandle.retry_count.
handle = task_manager.get_handle(task_id)
if handle.status == TaskStatus.RETRYING:
print(f"Retrying (attempt {handle.retry_count})")
Under the Hood¶
run_with_retry drives the retry
loop. When max_retries > 0, it runs the agent via Agent.iter() so that, on a
transient failure, the accumulated history from the failed attempt is replayed via
message_history on the next attempt. When max_retries <= 0, it falls through to
a plain agent.run() — the legacy path, unchanged. Event streaming
(event_stream_handler and wrap_run_event_stream capabilities) keeps working
across retries, and cooperative (soft) cancellation is honoured at node boundaries
on the retry-driven path.
See the Prompts & Retry API for full signatures.
Next Steps¶
- Execution Modes - Sync vs async delegation
- Cancellation - Stopping running tasks