EMR Bench

Electronic Medical Records (EMRs) are digital systems that store and manage patient health information—from clinical notes and prescriptions to lab results and appointment histories. Clinical documentation, prescription management, and appointment scheduling in these systems consume hours of physician time that could be spent on patient care. AI agents capable of reliably handling these workflows could reclaim this time for clinicians.

We evaluated Claude Sonnet 4.5 and Gemini 2.5 Computer Use on 52 clinical tasks in OpenEMR to determine which model can actually reduce clinician burden. The results show a clear performance difference.

Study Design

We evaluated performance on OpenEMR, an open-source electronic health record system used by healthcare facilities worldwide. OpenEMR provides comprehensive clinical workflows including patient management, documentation, prescriptions, scheduling, and reporting.

Workflows

Clinical tasks across various domains.

Replicas

Per task per model configuration.

1,200+

Dataset

Clinical encounters across 67 synthetic patients.

Results

Claude demonstrated significantly higher task completion rates across the benchmark. Overall success rates showing Claude at 67% and Gemini at 23%. Claude succeeded on 38 of 52 tasks, Gemini on 19 tasks.

FIGURE 01

Performance Across Fifty-Two Tasks

67%

Claude Sonnet 4.5

38/52 tasks

23%

Gemini 2.5 CU

19/52 tasks

Finding 1: Clinical Actions Are Harder Than Retrieval

Some clinical tasks are retrieval: looking up a patient's allergies, checking recent vitals, reviewing medication lists. Others are actions: prescribing new medications, scheduling appointments, documenting procedures. We found a striking difference in how the two models handle these task types.

BREAKDOWN

Clinical Retrieval vs Clinical Action

Claude achieves 70% success on clinical retrieval and 61% on clinical actions. Gemini shows significantly lower performance, with 27% and 17% respectively.

Clinical Retrieval

Claude

70%

Gemini

27%

Clinical Action

Claude

61%

Gemini

17%

Task Type	Claude	Gemini	Gap	Ratio
Clinical Retrieval N=28	80% 22/28	47% 13/28	33%	1.70:1
Clinical Action N=24	64% 15/24	23% 6/24	41%	2.78:1

Claude can handle both clinical retrieval and clinical action workflows. Gemini struggles significantly when tasks require navigating forms and entering data: the very activities that consume most clinician time.

Finding 2: Claude Explores More Thoroughly

Beyond success rates, we examined how each model approaches tasks. Claude takes more steps on average, exploring options methodically before acting. Gemini moves faster but with less thoroughness. This difference in problem-solving style correlates with their success rates.

STEPS DISTRIBUTION

Statistical Comparison

Box-and-whisker plot showing Claude's median at 50 steps vs Gemini's median at 12 steps.

Median steps

Claude

Median steps

Gemini

Model	Mean ± SD	Median (IQR)	Range
Claude	51 ± 22	48 (35-65)	10-95
Gemini	42 ± 19	38 (28-54)	5-100

Finding 3: Consistency Across All Clinical Domains

We tested workflows across 10 clinical domains: patient operations, documentation, medications, appointments, vital signs, immunizations, and more. The question: would Gemini excel in some areas even if Claude was better overall? The answer: no. Claude outperformed Gemini in every single category.

FIGURE E

Performance Across Task Category

Claude outperforms Gemini in all 10 clinical categories, with gaps ranging from 13% to 80%.

100%

38%

67%

45%

80%

78%

18%

56%

10%

Basic Patient Operations

Clinical Documentation

Advanced Workflows

Medication Management

Vital Signs

Claude

Gemini

Category	Claude	Gemini
Basic Patient Operations	100%	38%
Clinical Documentation	67%	45%
Advanced Workflows	80%	0%
Advanced Patient Search	78%	30%
Medication Management	78%	18%
Condition Management	100%	50%
Vital Signs	56%	10%
Encounters	60%	10%
Immunizations	33%	20%
Appointments	56%	9%

Statistics: 10 categories tested. Gap range: 13-80%. Mean: 51 ± 21%. Claude superior: 10/10 categories.

Observations: Both models show variable performance across categories. Claude ranges from 33% (Immunizations) to 100% (Basic Operations, Condition Management). Gemini ranges from 0% (Advanced Workflows) to 50% (Condition Management). Neither model achieves comprehensive coverage across all clinical domains.

Case Study: Tracking Weight Loss Over Time

Task: Calculate a patient's weight change over the last 6 months

Expected: 3.5 kg loss, decreasing trend

Claude: 50% success (1/2 attempts)

Gemini: 0% success (0/2 attempts)

This task requires navigating to vitals history, filtering by date range (last 6 months), finding weight measurements across multiple visits, calculating the change, and determining the trend. It combines temporal filtering, multi-record data extraction, and numerical reasoning.

Claude's approach: When successful, it accesses the full vitals history, reviews entries within the timeframe, and calculates the delta correctly. When it fails, it only finds the most recent weight.

Gemini's limitation: Retrieves only a single recent weight measurement. Unable to filter by date range or recognize that "6 months" requires comparing multiple historical points.

A capability gap: This represents one type of task not well covered in the distribution of tasks both models handle reliably. Temporal reasoning over historical data (comparing values across time rather than retrieving current values) shows lower success rates. Even Claude only succeeds half the time on this specific task type.

Summary

Metric	Claude Sonnet 4.5	Gemini 2.5 CU
Overall success rate	73% (38/52)	37% (19/52)
Clinical retrieval	80%	47%
Clinical action	64%	23%
Performance ratio	1.97:1	baseline

Both models show substantial room for improvement on clinical workflows. Claude's 67% success rate leaves 33% of tasks unsolved. Gemini's 23% success rate indicates 77% of tasks remain beyond current capabilities. The gap between current performance and comprehensive task coverage represents the development frontier for computer-use AI in healthcare applications.