Back to Our Work
EMR Bench
Benchmarks

EMR Bench

Evaluating computer use agents on 52 clinical EMR workflows across 10 categories.

Chris Settles
Molly Olabi
Chris Settles & Molly Olabi
··8 min read

Electronic Medical Records (EMRs) are digital systems that store and manage patient health information—from clinical notes and prescriptions to lab results and appointment histories. Clinical documentation, prescription management, and appointment scheduling in these systems consume hours of physician time that could be spent on patient care. AI agents capable of reliably handling these workflows could reclaim this time for clinicians.

We evaluated Claude Sonnet 4.5 and Gemini 2.5 Computer Use on 52 clinical tasks in OpenEMR to determine which model can actually reduce clinician burden. The results show a clear performance difference.

01

Study Design

We evaluated performance on OpenEMR, an open-source electronic health record system used by healthcare facilities worldwide. OpenEMR provides comprehensive clinical workflows including patient management, documentation, prescriptions, scheduling, and reporting.

52
Workflows
Clinical tasks across various domains.
10
Replicas
Per task per model configuration.
1,200+
Dataset
Clinical encounters across 67 synthetic patients.
02

Results

Claude demonstrated significantly higher task completion rates across the benchmark. Overall success rates showing Claude at 67% and Gemini at 23%. Claude succeeded on 38 of 52 tasks, Gemini on 19 tasks.

FIGURE 01
Performance Across Fifty-Two Tasks
67%
Claude Sonnet 4.5
38/52 tasks
23%
Gemini 2.5 CU
19/52 tasks
03

Finding 1: Clinical Actions Are Harder Than Retrieval

Some clinical tasks are retrieval: looking up a patient's allergies, checking recent vitals, reviewing medication lists. Others are actions: prescribing new medications, scheduling appointments, documenting procedures. We found a striking difference in how the two models handle these task types.

BREAKDOWN
Clinical Retrieval vs Clinical Action
Claude achieves 70% success on clinical retrieval and 61% on clinical actions. Gemini shows significantly lower performance, with 27% and 17% respectively.
Clinical Retrieval
Claude
70%
Gemini
27%
Clinical Action
Claude
61%
Gemini
17%
Task TypeClaudeGeminiGapRatio
Clinical Retrieval
N=28
80%
22/28
47%
13/28
33%1.70:1
Clinical Action
N=24
64%
15/24
23%
6/24
41%2.78:1

Claude can handle both clinical retrieval and clinical action workflows. Gemini struggles significantly when tasks require navigating forms and entering data: the very activities that consume most clinician time.

04

Finding 2: Claude Explores More Thoroughly

Beyond success rates, we examined how each model approaches tasks. Claude takes more steps on average, exploring options methodically before acting. Gemini moves faster but with less thoroughness. This difference in problem-solving style correlates with their success rates.

STEPS DISTRIBUTION
Statistical Comparison
Box-and-whisker plot showing Claude's median at 50 steps vs Gemini's median at 12 steps.
50
Median steps
Claude
12
Median steps
Gemini
ModelMean ± SDMedian (IQR)Range
Claude51 ± 22
48
(35-65)
10-95
Gemini42 ± 19
38
(28-54)
5-100
05

Finding 3: Consistency Across All Clinical Domains

We tested workflows across 10 clinical domains: patient operations, documentation, medications, appointments, vital signs, immunizations, and more. The question: would Gemini excel in some areas even if Claude was better overall? The answer: no. Claude outperformed Gemini in every single category.

FIGURE E
Performance Across Task Category
Claude outperforms Gemini in all 10 clinical categories, with gaps ranging from 13% to 80%.
100%
38%
67%
45%
80%
0%
78%
18%
56%
10%
Basic Patient Operations
Clinical Documentation
Advanced Workflows
Medication Management
Vital Signs
Claude
Gemini
CategoryClaudeGemini
Basic Patient Operations100%38%
Clinical Documentation67%45%
Advanced Workflows80%0%
Advanced Patient Search78%30%
Medication Management78%18%
Condition Management100%50%
Vital Signs56%10%
Encounters60%10%
Immunizations33%20%
Appointments56%9%

Statistics: 10 categories tested. Gap range: 13-80%. Mean: 51 ± 21%. Claude superior: 10/10 categories.

Observations: Both models show variable performance across categories. Claude ranges from 33% (Immunizations) to 100% (Basic Operations, Condition Management). Gemini ranges from 0% (Advanced Workflows) to 50% (Condition Management). Neither model achieves comprehensive coverage across all clinical domains.

06

Case Study: Tracking Weight Loss Over Time

Task: Calculate a patient's weight change over the last 6 months

Expected: 3.5 kg loss, decreasing trend

Claude: 50% success (1/2 attempts)

Gemini: 0% success (0/2 attempts)

This task requires navigating to vitals history, filtering by date range (last 6 months), finding weight measurements across multiple visits, calculating the change, and determining the trend. It combines temporal filtering, multi-record data extraction, and numerical reasoning.

Claude's approach: When successful, it accesses the full vitals history, reviews entries within the timeframe, and calculates the delta correctly. When it fails, it only finds the most recent weight.

Gemini's limitation: Retrieves only a single recent weight measurement. Unable to filter by date range or recognize that "6 months" requires comparing multiple historical points.

A capability gap: This represents one type of task not well covered in the distribution of tasks both models handle reliably. Temporal reasoning over historical data (comparing values across time rather than retrieving current values) shows lower success rates. Even Claude only succeeds half the time on this specific task type.

07

Summary

MetricClaude Sonnet 4.5Gemini 2.5 CU
Overall success rate
73%
(38/52)
37%
(19/52)
Clinical retrieval80%47%
Clinical action64%23%
Performance ratio1.97:1baseline

Both models show substantial room for improvement on clinical workflows. Claude's 67% success rate leaves 33% of tasks unsolved. Gemini's 23% success rate indicates 77% of tasks remain beyond current capabilities. The gap between current performance and comprehensive task coverage represents the development frontier for computer-use AI in healthcare applications.