PDF Bench

PDF forms are everywhere: tax filings, medical intake, NDAs, insurance enrollment. They are brittle, visually complex, and demand near-perfect accuracy. That makes them one of the hardest automation targets still unsolved. So we asked: Can today's frontier AI models reliably fill out a PDF?

The Benchmark

PDFBench consists of 70 real-world PDF form-filling tasks spanning 9 common form types. Each task gives an AI agent a blank PDF form, written instructions describing what data to fill in, programmatic access via PyMuPDF, and optionally a screenshot tool to visually inspect the form. The goal is simple: fill out the form correctly.

1,082

Total Trials

12 configurations × 90 tasks each.

Form Types

Medical, tax, business, healthcare.

64.3%

Overall Pass Rate

696 of 1,082 trials passed.

Results Overview

We tested three frontier models: Gemini Pro 3 (Google), Claude Opus 4.5 (Anthropic), and GPT-5.2 (OpenAI). Each model ran with and without visual (screenshot) access to the PDF.

FIGURE 01

Overall Pass Rate by Model

Average pass rate (%) across all model configurations. Claude Opus 4.5 leads at 71.1%.

Claude Opus 4.5

Gemini Pro 3

GPT-5.2

The Simplest Task

We started by asking: what's the simplest possible task for an LLM to complete?

Each task includes field names, their locations on the page, and their values to write. This provides a complete mapping from instruction to action.

Fieldname Task32

With screenshots enabled, Claude Opus achieved a perfect 100% pass rate. Even without visual access, it still hit 92.2%.

FIGURE 02

Fieldname Task Performance

Pass rate (%) by model and screenshot availability

Claude Opus 4.5

Gemini Pro 3

GPT-5.2

Takeaway: With explicit field mappings, frontier models are approaching real-world viability. Not quite production-ready for high-stakes workflows, but close.

Adding Real-World Flavor

We noticed something: many PDF field names are useless. You've probably seen fields like undefined_2 and Check Box10 throughout this post. Below is what a typical form looks like internally. Opaque names everywhere.

Misnamed Fields72

So we tried something different. What if we described fields the way a person would: by what they mean, not what they're called in the code?

We created a new set of tasks to test this.

Semantic Task32

Average pass rate dropped 52 points. Claude Opus, the same model that achieved 100% with explicit field names, reached 56.7% here.

FIGURE 03

Semantic Task Performance

Pass rate (%) by model and screenshot availability

Claude Opus 4.5

Gemini Pro 3

GPT-5.2

Key Insights

The benchmark surfaced two recurring failure patterns.

The Balancing Act

PDFs often contain conflicting signals: metadata says one thing, visual layout says another. Models struggle to reconcile them, and that's where failures cluster.

The Scenario: The EyeMed form. The PDF code labels phone number parts as undefined_5. The visual screenshot clearly shows a box for digits.

When we told the model exactly which field ID to fill (Fieldname task), it scored 100%. When we gave it semantic instructions ("Fill the phone number"), accuracy dropped to 71.7%, a 28.3-point gap.

Of that gap, 99.6% were "empty" errors. The model could often (not always) identify fields in the screenshot and plan to fill them. The confusing metadata labels seemed to win out, but without any explicit reconciliation between the two signals.

Visual Reasoning

For simpler forms like the rental application, screenshots helped models jump from 70.5% to 90.6% field accuracy. But the prescription form told a different story.

The Stat: Even with screenshots, the strict pass rate was a dismal 3.1%.

The Logic Check: Unlike the Rental form, partial credit didn't even help here. The "field correctness" only nudged up by 4.6% (from 88% to 92%).

The "Why": The Prescription form relies on implicit business logic.

The model can see the boxes. What it misses are the implicit constraints: that filling one section means checking a box nearby, or that choosing the left column means leaving the right one blank. These rules are encoded in layout, not labeled anywhere.

Below we dive deeper into the specific failure patterns we observed.

Failure Mode Deep Dive

Analyzing thousands of agent traces revealed three main failure patterns that persist even when models achieve high field-level accuracy.

Field Mapping: Semantic → Opaque

Users describe fields naturally - "Company Email", "Bank Account Number" - but PDF forms use programmatic names like Text Field 720 that bear no semantic relationship to their purpose.

Invoice Template: 81.58% accuracy, FAILED

Instruction

Company Email

Company Website

Customer ID

PDF Field

Text Field 720

Text Field 721

Text Field 722

Agent Filled

98105 (zip code)

206-555-8421 (phone)

001 (truncated)

The model correctly identified all three values but placed them in the wrong fields. Spatial proximity combined with meaningless field names made distinguishing fields impossible.

Implicit Logic: Business Rules Encoded in Layout

This is where things get interesting. PDF forms encode business rules through visual layout (sections, groupings, conditional dependencies) rather than explicit labels. Models must infer these rules from spatial relationships and domain knowledge, and this is where they fail most catastrophically.

Consider what a human implicitly understands when filling out a prescription form: checking the "AllianceRx Walgreens" pharmacy option means you've chosen a specialty pharmacy, which requires prescriber attestation. Selecting one medication dosage means leaving the other dosage options blank. These rules are never written down. They're encoded in how the form is organized.

Prescription Form: 93.88% accuracy, FAILED

The agent filled 46 of 49 fields correctly. But the 3 fields it missed weren't random. They form a logical dependency chain:

O6: Pharmacy selection

Expected '3' (AllianceRx), got 'Off'

↓ requires ↓

O3: Prescriber attestation

Expected 'Yes', got 'Off'

↓ requires ↓

Yes_1: Message permission

Expected 'Yes', got empty

The implicit rule: Selecting a specialty pharmacy triggers a cascade: you must also confirm prescriber attestation, which in turn requires message permission. None of this is labeled. It's encoded in the form's visual grouping.

Why this matters: The prescription form had a 3.1% strict pass rate even with screenshots enabled. Field accuracy only improved from 88% to 92%, a mere 4.6 point gain. The model can see the checkboxes. What it can't do is infer that checking one box logically requires checking another.

This pattern appears across form types. Medical consent forms have witness signatures that only apply if the patient can't sign. Tax forms have sections that become mandatory based on earlier selections. Insurance enrollment has coverage tiers where selecting one option invalidates others. These dependencies are obvious to humans who understand the domain. They're invisible to models that process fields independently.

Other implicit logic patterns we observed:

•Mutual exclusivity: Selecting "100mg capsule" means "150mg capsule" must remain unchecked

•Conditional sections: "Prior therapy details" only applies if "Received prior therapy" is checked

•Signature cascades: Patient signature date must match or follow the prescriber signature date

•Section completeness: Filling any field in "Insurance Information" requires all fields in that section

Spatial Reasoning: Adjacent Field Confusion

When multiple semantically similar fields appear within 50-100 pixels of each other, models confuse their positions. This is especially problematic for financial data (routing vs. account numbers), address components, and date fields.

Invoice Banking Section: All values correct, all positions wrong

ABA/RTN (Field 693)

1928374650

Should be: 125000024

Account # (Field 695)

Wellington...

Should be: 6677889900

Account Holder (Field 697)

054000030

Should be: Maya Patel

Pattern: Values are systematically rotated, like a cipher wheel off by one position. The model identified all correct values but shifted each to the adjacent field.

What This Means

PDF form filling sits at the intersection of vision, reasoning, and structured data manipulation. Current frontier models can achieve impressive field-level accuracy, but production deployment requires near-perfect reliability. The gap between 93% accuracy and 100% represents billions of dollars in manual review, error correction, and compliance risk.

The path forward likely involves hybrid approaches: using model capabilities for understanding and planning while building robust tooling for execution. Until then, PDFBench provides a clear benchmark for measuring progress on one of automation's most stubborn frontiers.

Happy to chat to give more details, book a call.