Electronic Medical Records (EMRs) are digital systems that store and manage patient health information—from clinical notes and prescriptions to lab results and appointment histories. Clinical documentation, prescription management, and appointment scheduling in these systems consume hours of physician time that could be spent on patient care. AI agents capable of reliably handling these workflows could reclaim this time for clinicians.
We evaluated Claude Sonnet 4.5 and Gemini 2.5 Computer Use on 52 clinical tasks in OpenEMR to determine which model can actually reduce clinician burden. The results show a clear performance difference.
Study Design
We evaluated performance on OpenEMR, an open-source electronic health record system used by healthcare facilities worldwide. OpenEMR provides comprehensive clinical workflows including patient management, documentation, prescriptions, scheduling, and reporting.
Results
Claude demonstrated significantly higher task completion rates across the benchmark. Overall success rates showing Claude at 67% and Gemini at 23%. Claude succeeded on 38 of 52 tasks, Gemini on 19 tasks.
Finding 1: Clinical Actions Are Harder Than Retrieval
Some clinical tasks are retrieval: looking up a patient's allergies, checking recent vitals, reviewing medication lists. Others are actions: prescribing new medications, scheduling appointments, documenting procedures. We found a striking difference in how the two models handle these task types.
Claude can handle both clinical retrieval and clinical action workflows. Gemini struggles significantly when tasks require navigating forms and entering data: the very activities that consume most clinician time.
Finding 2: Claude Explores More Thoroughly
Beyond success rates, we examined how each model approaches tasks. Claude takes more steps on average, exploring options methodically before acting. Gemini moves faster but with less thoroughness. This difference in problem-solving style correlates with their success rates.
Finding 3: Consistency Across All Clinical Domains
We tested workflows across 10 clinical domains: patient operations, documentation, medications, appointments, vital signs, immunizations, and more. The question: would Gemini excel in some areas even if Claude was better overall? The answer: no. Claude outperformed Gemini in every single category.
Statistics: 10 categories tested. Gap range: 13-80%. Mean: 51 ± 21%. Claude superior: 10/10 categories.
Observations: Both models show variable performance across categories. Claude ranges from 33% (Immunizations) to 100% (Basic Operations, Condition Management). Gemini ranges from 0% (Advanced Workflows) to 50% (Condition Management). Neither model achieves comprehensive coverage across all clinical domains.
Case Study: Tracking Weight Loss Over Time
Task: Calculate a patient's weight change over the last 6 months
Expected: 3.5 kg loss, decreasing trend
Claude: 50% success (1/2 attempts)
Gemini: 0% success (0/2 attempts)
This task requires navigating to vitals history, filtering by date range (last 6 months), finding weight measurements across multiple visits, calculating the change, and determining the trend. It combines temporal filtering, multi-record data extraction, and numerical reasoning.
Claude's approach: When successful, it accesses the full vitals history, reviews entries within the timeframe, and calculates the delta correctly. When it fails, it only finds the most recent weight.
Gemini's limitation: Retrieves only a single recent weight measurement. Unable to filter by date range or recognize that "6 months" requires comparing multiple historical points.
A capability gap: This represents one type of task not well covered in the distribution of tasks both models handle reliably. Temporal reasoning over historical data (comparing values across time rather than retrieving current values) shows lower success rates. Even Claude only succeeds half the time on this specific task type.
Summary
Both models show substantial room for improvement on clinical workflows. Claude's 67% success rate leaves 33% of tasks unsolved. Gemini's 23% success rate indicates 77% of tasks remain beyond current capabilities. The gap between current performance and comprehensive task coverage represents the development frontier for computer-use AI in healthcare applications.


