Back to Our Work
Gauntlet
Datasets

Gauntlet

An SFT dataset of real developer workflows for training production-grade code models.

Erik Quintanilla
Erik Quintanilla
··7 min read
An SFT dataset for teams building coding agents for real world scenarios.

SWE-Bench changed how we evaluate coding agents. Instead of isolated puzzles, it tests whether models can actually resolve real GitHub issues: navigating codebases, understanding context, and producing working patches.

But there's a gap: most training data is still synthetic. We're releasing Gauntlet, a dataset of 79,139 real developer workflows from 1,669 production repositories, where every single example can be verified against real pytest tests. The kind of data that teaches models how software engineering actually works.

01

What's in Gauntlet

Gauntlet captures authentic development workflows from high-quality Python repositories using runtime tracing. Each example pairs a task description with the actual code changes a developer made. We only include examples that actually run and have an associated passing test.

1,669
Repositories
Real open-source projects (Django, Flask, NumPy, etc.)
79,139
Examples
Pytest-verifiable developer workflows.
37GB
Source Code
Full repositories with git history.
100%
Multi-File
Every example involves cross-file context.
Every Example is Verifiable
Unlike other code datasets, every Gauntlet example can be tested against real pytest tests from the original repository. This isn't "looks correct". It's verified correct. These tests also provide reward signals for RL.
Dependencies Pre-Resolved
Every repository comes with dependencies already resolved and available as Docker containers. No wrestling with version conflicts or broken environments. Just pull and run.
vs Other Datasets
DatasetSizeVerifiableReal Code
HumanEval164✓ (manual tests)
MBPP974✓ (manual tests)
SWE-Bench2,294✓ (pytest)
CodeAlpaca20K
Magicoder75KPartial
Gauntlet
79K
✓ (pytest)
Task Distribution
Task TypeShareWhat It Teaches
Bug Fixes85.5%Debugging, root cause analysis, patch generation
Feature Additions14.1%Extending existing systems, API design
Refactoring0.4%Code quality, architectural changes

The heavy emphasis on bug fixes mirrors real development. Most engineering time is spent understanding and fixing existing code, which is exactly what SWE-Bench evaluates.

Token Distribution
MetricValueWhat It Means
Median length7,329 tokensSubstantial context in every example
Max length1.2M tokensHandles massive real-world codebases
02

Training Results

We fine-tuned Devstral-Small 24B on Gauntlet using LoRA (rank 64) for 7h 40m on 8x A100 GPUs. The HumanEval results show dramatic improvements.

Devstral
Base Model
Small 24B
7h 40m
Training Time
8x A100, one epoch
LoRA
Method
Rank 64, efficient fine-tuning
12,288
Context Length
Extended sequence length
HUMANEVAL RESULTS
HumanEval Performance (pass@k)
Baseline
Gauntlet-Trained

The Gauntlet-trained model achieves near-perfect HumanEval scores, with pass@1 jumping from 17.87% to 96.40% and pass@10 reaching 100%. It also generates 3-4x faster than baseline by learning when to stop appropriately.

03

SWE-Bench Lite Results

We evaluated the Gauntlet-trained Devstral model on SWE-Bench Lite (300 instances). The results show meaningful improvements in real-world issue resolution.

+50%
Issues Resolved
12 → 18 instances
+69%
Valid Patches
87 → 147 patches
MetricBaselineGauntlet-TrainedChange
Resolved4.0% (12/300)6.0% (18/300)+50%
Valid Patches87147+69%
Empty Patches71%51%-28%

Note: Devstral's published 46.8% score is on SWE-Bench Verified (500 manually screened instances). We ran on SWE-Bench Lite (300 instances, different subset), which explains the lower absolute numbers for both baseline and trained models.

04

Summary

Key Numbers
Repositories1,669 real open-source projects
Examples79,139 pytest-verifiable workflows
Source code37GB with full git history
Multi-file coverage100% (vs 2% for synthetic)
Training improvement3.6% loss reduction on held-out tasks

If you're training models for real-world coding tasks, you need training data that looks like real-world coding tasks. And you need to verify it actually works. Gauntlet provides both.

Interested in training on Gauntlet? Book a call.