Document Parsing (LiteParse)¶

LiteparseToolset gives the agent the ability to extract text and generate page screenshots from PDFs, Word documents, spreadsheets, and other document formats — locally, with no cloud services required.

Parsing is powered by LiteParse, a Node.js library with optional OCR. The Python package wraps its CLI via subprocess.

Requirements¶

Node.js >= 18 installed on the system
LiteParse CLI: npm install -g @llamaindex/liteparse
Python extra: pip install pydantic-deep[liteparse]

The CLI is auto-installed via npm on first use if npm is in PATH (controlled by install_if_not_available, default True). For production/Docker deployments, pre-install the CLI in your Dockerfile instead.

Docker¶

Docker

FROM python:3.12-slim

# Install Node.js
RUN apt-get update && apt-get install -y curl && \
    curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \
    apt-get install -y nodejs

# Pre-install LiteParse CLI
RUN npm install -g @llamaindex/liteparse

COPY . .
RUN pip install pydantic-deep[liteparse]

Quick Start¶

Enable with include_liteparse=True in create_deep_agent:

Python

from pydantic_deep import create_deep_agent, create_default_deps
from pydantic_ai_backends import LocalBackend

agent = create_deep_agent(
    model="anthropic:claude-sonnet-4-6",
    include_liteparse=True,
)

deps = create_default_deps(LocalBackend(root_dir="."))
result = await agent.run("Parse report.pdf and summarize the key findings", deps=deps)

Standalone Usage¶

Use [LiteparseToolset][pydantic_deep.LiteparseToolset] directly for fine-grained control:

Python

from pydantic_ai import Agent
from pydantic_deep import DeepAgentDeps
from pydantic_deep.toolsets.liteparse import LiteparseToolset

agent = Agent(
    "anthropic:claude-sonnet-4-6",
    deps_type=DeepAgentDeps,
    toolsets=[
        LiteparseToolset(
            ocr_enabled=True,
            ocr_language="en",
            dpi=300,
        )
    ],
)

Available Tools¶

Tool	Description
`parse_document`	Extract full text from a document (PDF, DOCX, XLSX, images, …)
`screenshot_document`	Generate per-page PNG screenshots saved to the backend

Configuration Options¶

Parameter	Default	Description
`ocr_enabled`	`True`	Enable OCR for scanned/image-based documents
`ocr_language`	`"en"`	OCR language code (`"en"`, `"fr"`, `"de"`, …)
`ocr_server_url`	`None`	HTTP OCR server URL. Uses built-in Tesseract when not set
`dpi`	`150`	Rendering DPI — higher is better for OCR but slower
`max_pages`	`10000`	Maximum pages to parse per document
`install_if_not_available`	`True`	Auto-install CLI via npm on first use
`descriptions`	`None`	Dict to override tool descriptions (`parse_document`, `screenshot_document`)

Supported Formats¶

PDF — native text extraction + OCR for scanned pages
Microsoft Office — DOCX, XLSX, PPTX (requires LibreOffice on the system)
OpenDocument — ODT, ODS, ODP (requires LibreOffice)
Images — PNG, JPG, TIFF and more (requires ImageMagick)

Custom OCR Server¶

Point to a PaddleOCR or EasyOCR server for higher accuracy:

Python

LiteparseToolset(
    ocr_server_url="http://localhost:8828/ocr",
    ocr_language="en",
    dpi=200,
)

See the OCR server examples in the LiteParse repository for EasyOCR and PaddleOCR server implementations.

Notes¶

parse_document passes file contents as bytes to the CLI via stdin — no temp files written for PDFs.
screenshot_document writes the file to a temp directory, calls the CLI, then copies images to the backend.
Large documents may be slow on first call due to Node.js + PDF engine cold start. Subsequent calls within the same process reuse the parser instance.