Skip to content

OCR Parsing with PydanticAI

These examples demonstrate document processing capabilities using PydanticAI for OCR (Optical Character Recognition). From basic text extraction to structured output with schema validation.

Setup

macOS Users

Install the poppler dependency required by pdf2image:

brew install poppler

For other platforms, see Troubleshooting below.

Examples

1. Basic OCR Demo

File: 1_basic_ocr_demo.py

Demonstrates a basic flow for OCR on various document types. Output is Markdown-formatted content — LLMs excel at Markdown formatting tasks.

Sample Output
# Invoice

YesLogic Pty. Ltd.
7 / 39 Bouverie St
Carlton VIC 3053
Australia

**Invoice date:** Nov 26, 2016
**Invoice number:** 161126
**Payment due:** 30 days after invoice date

| Description               | From         | Until        | Amount      |
|---------------------------|-------------|-------------|-------------|
| Prince Upgrades & Support | Nov 26, 2016 | Nov 26, 2017 | USD $950.00 |
| **Total**                 |             |             | USD $950.00 |

2. OCR with Structured Output

File: 2_ocr_with_structured_output.py

Uses Pydantic BaseModel schemas provided to the LLM before inference starts. Combined with a customized prompt, the results are high quality with built-in type verification.

Output structure:

{
    "filename": "file_name_page_1.jpg",
    "analysis_result": {
        "file_type": "invoice",
        "file_content_md": "# Sunny Farm ...",
        "file_elements": [
            {
                "element_type": "table",
                "element_content": "| Item | Price | ... |"
            }
        ]
    }
}

Why Structured Output?

The upside of this approach is built-in verification of returned data types — ensuring you get the structure you want on every inference.

3. OCR Validation

File: 3_ocr_validation.py

Demonstrates purposeful ValidationError handling when LLM output doesn't match the expected schema. Uses a simplified Pydantic model to highlight validation behavior.

| ERROR | demonstrate_validation_error:35 - --- VALIDATION ERROR DETECTED ---
| INFO  | demonstrate_validation_error:44 - Field: 'file_elements'
| INFO  | demonstrate_validation_error:45 - Error Type: model_type
| INFO  | demonstrate_validation_error:46 - Reason: Input should be a valid dictionary
                                           or instance of FileElement
| INFO  | demonstrate_validation_error:47 - What the LLM actually sent: "No elements found"

Running

cd ocr_parsing

# Basic OCR
uv run 1_basic_ocr_demo.py

# Structured output
uv run 2_ocr_with_structured_output.py

# Validation errors (uncomment validation line in code first)
uv run 3_ocr_validation.py

Key Concepts

  • PDF to image conversion — Each PDF page is converted to .jpg for optimal LLM input
  • Structured schemas — Pydantic models enforce output structure and type safety
  • Parallel async processing — Semaphore-based concurrency control for multiple documents
  • Validation errors — Graceful handling when LLM output doesn't match the schema

Troubleshooting

poppler not found

brew install poppler
sudo apt-get install poppler-utils

choco install poppler
Or download from Poppler releases and add to PATH.

Rate limiting or timeout errors

Concurrency is limited to 5 parallel requests via semaphore. If you still hit rate limits, reduce the value in shared_fns.py:

semaphore = asyncio.Semaphore(3)  # Reduce from 5 to 3

File Samples

All sample files were downloaded from Prince XML.