Context Manager¶

ContextManagerCapability is the all-in-one capability: live token tracking, automatic compression, tool-output truncation, and an optional agent-triggered compaction tool. This page walks through each feature with runnable snippets.

Live Token Tracking¶

Pass an on_usage_update callback to observe context usage on every model request. It receives the usage fraction, the current token count, and the resolved max budget. The same callback fires again after a compression so you can see usage drop.

Python

from pydantic_ai import Agent
from pydantic_ai_summarization import ContextManagerCapability


def on_usage(pct: float, current: int, max_tokens: int) -> None:
    bar = "#" * int(pct * 20)
    print(f"[{bar:<20}] {pct:.0%}  ({current}/{max_tokens} tokens)")


agent = Agent(
    "openai:gpt-4.1",
    capabilities=[
        ContextManagerCapability(
            max_tokens=100_000,
            on_usage_update=on_usage,
        )
    ],
)

If max_tokens is left as None, the capability auto-detects the model's context window via genai-prices on the first run, falling back to 200_000 when detection is unavailable.

Automatic Compression¶

Compression fires automatically once usage reaches compress_threshold (default 0.9). Tune the threshold and how much of the tail survives via keep:

Python

cap = ContextManagerCapability(
    max_tokens=100_000,
    compress_threshold=0.85,        # compress at 85% of the budget
    keep=("messages", 6),           # keep the last 6 messages after summarizing
    summarization_model="openai:gpt-4.1-mini",
)

With the default keep=("messages", 0), a compression leaves the generated summary plus the in-flight request — the pending tool returns or the new user prompt the model is about to act on. The summary itself is delivered as a user turn, so the rebuilt history is always something a provider can accept, including on Anthropic and Google where system prompts travel in a separate channel rather than the message list. See Capabilities for the full parameter table and a description of the threshold mechanism.

Tool-Output Truncation¶

Large tool results can dominate the context window. Set max_tool_output_tokens to truncate any tool output larger than that budget (measured as roughly tokens * 4 characters), keeping a head and tail slice:

Python

cap = ContextManagerCapability(
    max_tokens=100_000,
    max_tool_output_tokens=2_000,   # truncate tool outputs over ~2k tokens
    tool_output_head_lines=10,      # keep first 10 lines
    tool_output_tail_lines=5,       # keep last 5 lines
)

Truncation runs in after_tool_execute, so the trimmed output is what enters the message history. When max_tool_output_tokens is None (the default), tool outputs are never truncated.

Agent-Triggered Compaction¶

Set include_compact_tool=True to register a compact_conversation tool. The agent can then decide to compress the conversation itself, optionally focusing the summary on a topic:

Python

cap = ContextManagerCapability(
    max_tokens=100_000,
    include_compact_tool=True,
)

agent = Agent("openai:gpt-4.1", capabilities=[cap])

Calling the tool does not compress immediately — it sets a flag, and compaction is applied before the next model request. The optional focus argument is passed through to the summary prompt so the model prioritizes the requested topic.

Manual Compaction¶

You can also drive compaction yourself, either deferred or immediate:

Python

# Deferred: applied before the next model request (same path as the tool)
cap.request_compact(focus="the database migration plan")

# Immediate: compress a message list directly, outside agent.run()
compressed = await cap.compact(messages, focus="the database migration plan")

request_compact() queues compaction for the next request, while compact() runs the summarization immediately and returns the compressed list. Both accept an optional focus.

Compression Callbacks¶

Two callbacks let you observe or augment compression:

Python

def before_compress(messages: list, cutoff_index: int) -> None:
    # cutoff_index is the real index the processor chose — not a placeholder.
    print(f"About to compress {cutoff_index}/{len(messages)} messages")


def after_compress(messages: list, summarized: bool, summary: str | None) -> str | None:
    # summarized=False covers both "LLM failed" cases — previously indistinguishable
    # from a successful compression.
    if not summarized:
        print("Compression attempted but no summary was produced")
        return None
    # Returning a string re-injects it into the first request as a SystemPromptPart.
    return "Reminder: keep responses concise after compaction."


cap = ContextManagerCapability(
    max_tokens=100_000,
    on_before_compress=before_compress,
    on_after_compress=after_compress,
)

on_before_compress fires between the processor's plan and execute steps, with the actual cutoff index — so it really does run before the summary LLM. on_after_compress fires with a summarized flag and the generated summary text (or None); return a string to have it appended to the first request as a SystemPromptPart (re-injection happens only when summarized=True).

Next Steps¶

Capabilities - full parameter reference
Basic Usage - minimal setup
Advanced - custom token counters and prompts