ContextManagerMiddleware¶

The ContextManagerMiddleware is a dual-protocol class that provides real-time context management during agent execution. It combines:

History processor (pydantic-ai): Tracks token usage and auto-compresses conversation when approaching the token budget.
Agent middleware (pydantic-ai-middleware): Optionally truncates large tool outputs inline before they enter the conversation history.

Requires the hybrid extra

This middleware requires the hybrid optional dependency:

Bash

pip install summarization-pydantic-ai[hybrid]

See Installation for details.

How It Works¶

The middleware operates on two levels during each agent run:

Text Only

┌──────────────────────────────────────────────────────────────┐
│                      Agent Run Loop                          │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  1. History Processor (__call__)                              │
│     ├─ Count tokens in current messages                      │
│     ├─ Notify usage callback (percentage, current, max)      │
│     ├─ If usage >= compress_threshold:                        │
│     │   ├─ Summarize older messages via LLM                  │
│     │   ├─ Replace old messages with summary                 │
│     │   └─ Notify updated usage                              │
│     └─ Return (possibly compressed) messages                 │
│                                                              │
│  2. After Tool Call (after_tool_call)                         │
│     ├─ Check if result exceeds max_tool_output_tokens        │
│     ├─ If yes: truncate to head + tail lines                 │
│     └─ Return original or truncated result                   │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Token tracking: Before every model call, the middleware counts tokens in the current conversation and compares against max_tokens * compress_threshold. If the conversation is approaching the budget, it automatically compresses older messages using LLM summarization.

Tool output truncation: When max_tool_output_tokens is set, the middleware intercepts tool results via the after_tool_call hook and truncates any output that exceeds the token limit, keeping configurable head and tail lines.

Parameters¶

Parameter	Type	Default	Description
`max_tokens`	`int`	`200_000`	Maximum token budget for the conversation
`compress_threshold`	`float`	`0.9`	Fraction of `max_tokens` at which auto-compression triggers (0.0, 1.0]
`keep`	`ContextSize`	`("messages", 20)`	How much context to retain after compression
`summarization_model`	`str`	`"openai:gpt-4.1-mini"`	Model used for generating summaries
`token_counter`	`TokenCounter`	`count_tokens_approximately`	Function to count tokens in messages
`summary_prompt`	`str`	`DEFAULT_SUMMARY_PROMPT`	Prompt template for summary generation
`trim_tokens_to_summarize`	`int`	`4000`	Max tokens to include when generating the summary
`max_input_tokens`	`int \\| None`	`None`	Model max input tokens (required for fraction-based keep)
`max_tool_output_tokens`	`int \\| None`	`None`	Per-tool-output token limit before truncation. `None` disables truncation
`tool_output_head_lines`	`int`	`5`	Lines to show from the beginning of truncated tool output
`tool_output_tail_lines`	`int`	`5`	Lines to show from the end of truncated tool output
`on_usage_update`	`UsageCallback \\| None`	`None`	Callback invoked with usage stats before each model call

UsageCallback¶

The on_usage_update parameter accepts a callable with the signature:

Python

UsageCallback = Callable[[float, int, int], Any]

The callback receives three arguments:

percentage (float): Current usage as a fraction of max_tokens (e.g., 0.85 for 85%)
current_tokens (int): Current token count in the conversation
max_tokens (int): The configured maximum token budget

Both sync and async callables are supported. If the callable returns an awaitable, it will be awaited automatically.

Basic Usage¶

Python

from pydantic_ai import Agent
from pydantic_ai_middleware import MiddlewareAgent
from pydantic_ai_summarization import create_context_manager_middleware

middleware = create_context_manager_middleware(
    max_tokens=200_000,
    compress_threshold=0.9,
    keep=("messages", 20),
)

# Register as both history processor and middleware
agent = Agent(
    "openai:gpt-4.1",
    history_processors=[middleware],
)
wrapped = MiddlewareAgent(agent, middleware=[middleware])

With Usage Callback¶

Track token usage in real time:

Python

from pydantic_ai_summarization import create_context_manager_middleware

def on_usage(percentage: float, current: int, maximum: int) -> None:
    print(f"Token usage: {percentage:.0%} ({current:,} / {maximum:,})")

middleware = create_context_manager_middleware(
    max_tokens=200_000,
    on_usage_update=on_usage,
)

Async callbacks are also supported:

Python

async def on_usage_async(percentage: float, current: int, maximum: int) -> None:
    await save_usage_to_db(percentage, current, maximum)

middleware = create_context_manager_middleware(
    on_usage_update=on_usage_async,
)

With Tool Output Truncation¶

Prevent large tool outputs from consuming too much of the token budget:

Python

from pydantic_ai_summarization import create_context_manager_middleware

middleware = create_context_manager_middleware(
    max_tokens=200_000,
    max_tool_output_tokens=2000,      # Truncate outputs > ~2000 tokens
    tool_output_head_lines=10,        # Show first 10 lines
    tool_output_tail_lines=10,        # Show last 10 lines
)

When a tool output exceeds the limit, it is truncated to show the first and last N lines with an indicator of how many lines were omitted:

Text Only

Line 1
Line 2
...
Line 10

... (485 lines omitted) ...

Line 496
Line 497
...
Line 505

Factory Function¶

The create_context_manager_middleware() factory function provides a convenient way to create a configured middleware instance:

Python

from pydantic_ai_summarization import create_context_manager_middleware

# With defaults
middleware = create_context_manager_middleware()

# Fully configured
middleware = create_context_manager_middleware(
    max_tokens=150_000,
    compress_threshold=0.85,
    keep=("messages", 30),
    summarization_model="openai:gpt-4.1-mini",
    max_tool_output_tokens=1000,
    tool_output_head_lines=5,
    tool_output_tail_lines=5,
    on_usage_update=lambda pct, cur, mx: print(f"{pct:.0%}"),
)

Properties¶

compression_count¶

The compression_count property returns the number of times compression has been triggered during the lifetime of the middleware instance:

Python

middleware = create_context_manager_middleware()

# ... after some agent runs ...

print(f"Compressed {middleware.compression_count} times")

Comparison with Standalone Processors¶

Feature	ContextManagerMiddleware	SummarizationProcessor	SlidingWindowProcessor
Token tracking	Built-in	No	No
Usage callbacks	Yes	No	No
Auto-compression	Yes (threshold-based)	Yes (trigger-based)	No
Tool output truncation	Yes	No	No
LLM cost	Per compression	Per trigger	Zero
Requires extra	`[hybrid]`	No	No

Next Steps¶

Context Manager Examples - Practical usage patterns
Installation - Installing the hybrid extra
API Reference - Full API documentation