SWE-Bench changed how we evaluate coding agents. Instead of isolated puzzles, it tests whether models can actually resolve real GitHub issues: navigating codebases, understanding context, and producing working patches.
But there's a gap: most training data is still synthetic. We're releasing Gauntlet, a dataset of 79,139 real developer workflows from 1,669 production repositories, where every single example can be verified against real pytest tests. The kind of data that teaches models how software engineering actually works.
What's in Gauntlet
Gauntlet captures authentic development workflows from high-quality Python repositories using runtime tracing. Each example pairs a task description with the actual code changes a developer made. We only include examples that actually run and have an associated passing test.
The heavy emphasis on bug fixes mirrors real development. Most engineering time is spent understanding and fixing existing code, which is exactly what SWE-Bench evaluates.
Training Results
We fine-tuned Devstral-Small 24B on Gauntlet using LoRA (rank 64) for 7h 40m on 8x A100 GPUs. The HumanEval results show dramatic improvements.
The Gauntlet-trained model achieves near-perfect HumanEval scores, with pass@1 jumping from 17.87% to 96.40% and pass@10 reaching 100%. It also generates 3-4x faster than baseline by learning when to stop appropriately.
SWE-Bench Lite Results
We evaluated the Gauntlet-trained Devstral model on SWE-Bench Lite (300 instances). The results show meaningful improvements in real-world issue resolution.
Note: Devstral's published 46.8% score is on SWE-Bench Verified (500 manually screened instances). We ran on SWE-Bench Lite (300 instances, different subset), which explains the lower absolute numbers for both baseline and trained models.
Summary
If you're training models for real-world coding tasks, you need training data that looks like real-world coding tasks. And you need to verify it actually works. Gauntlet provides both.
Interested in training on Gauntlet? Book a call.

