Gauntlet

An SFT dataset for teams building coding agents for real world scenarios.

SWE-Bench changed how we evaluate coding agents. Instead of isolated puzzles, it tests whether models can actually resolve real GitHub issues: navigating codebases, understanding context, and producing working patches.

But there's a gap: most training data is still synthetic. We're releasing Gauntlet, a dataset of 79,139 real developer workflows from 1,669 production repositories, where every single example can be verified against real pytest tests. The kind of data that teaches models how software engineering actually works.

What's in Gauntlet

Gauntlet captures authentic development workflows from high-quality Python repositories using runtime tracing. Each example pairs a task description with the actual code changes a developer made. We only include examples that actually run and have an associated passing test.

1,669

Repositories

Real open-source projects (Django, Flask, NumPy, etc.)

79,139

Examples

Pytest-verifiable developer workflows.

37GB

Source Code

Full repositories with git history.

100%

Multi-File

Every example involves cross-file context.

Every Example is Verifiable

Unlike other code datasets, every Gauntlet example can be tested against real pytest tests from the original repository. This isn't "looks correct". It's verified correct. These tests also provide reward signals for RL.

Dependencies Pre-Resolved

Every repository comes with dependencies already resolved and available as Docker containers. No wrestling with version conflicts or broken environments. Just pull and run.

vs Other Datasets

Dataset	Size	Verifiable	Real Code
HumanEval	164	✓ (manual tests)	✗
MBPP	974	✓ (manual tests)	✗
SWE-Bench	2,294	✓ (pytest)	✓
CodeAlpaca	20K	✗	✗
Magicoder	75K	✗	Partial
Gauntlet	79K	✓ (pytest)	✓

Task Distribution

Task Type	Share	What It Teaches
Bug Fixes	85.5%	Debugging, root cause analysis, patch generation
Feature Additions	14.1%	Extending existing systems, API design
Refactoring	0.4%	Code quality, architectural changes

The heavy emphasis on bug fixes mirrors real development. Most engineering time is spent understanding and fixing existing code, which is exactly what SWE-Bench evaluates.

Token Distribution

Metric	Value	What It Means
Median length	7,329 tokens	Substantial context in every example
Max length	1.2M tokens	Handles massive real-world codebases

Training Results

We fine-tuned Devstral-small-2507 on Gauntlet using LoRA (rank 64) for 7h 40m on 8x A100 GPUs. The HumanEval results show dramatic improvements.

Devstral-small-2507

Base Model

24B parameters

7h 40m

Training Time

8x A100, one epoch

LoRA

Method

Rank 64, efficient fine-tuning

12,288

Context Length

Extended sequence length

HUMANEVAL RESULTS

Devstral-small-2507 HumanEval Performance (pass@k)

Baseline

Gauntlet-Trained

The Gauntlet-trained model achieves near-perfect HumanEval scores, with pass@1 jumping from 17.87% to 96.40% and pass@10 reaching 100%. It also generates 3-4x faster than baseline by learning when to stop appropriately.

SWE-Bench Lite Results

We evaluated the Gauntlet-trained Devstral-small-2507 model on SWE-Bench Lite (300 instances). The results show meaningful improvements in real-world issue resolution.

+50%

Issues Resolved

12 → 18 instances

+69%

Valid Patches

87 → 147 patches

Metric	Baseline	Gauntlet-Trained	Change
Resolved	4.0% (12/300)	6.0% (18/300)	+50%
Valid Patches	87	147	+69%
Empty Patches	71%	51%	-28%

Note: Devstral-small-2507's published 46.8% score is on SWE-Bench Verified (500 manually screened instances). We ran on SWE-Bench Lite (300 instances, different subset), which explains the lower absolute numbers for both baseline and trained models.

Summary

Key Numbers


Repositories	1,669 real open-source projects
Examples	79,139 pytest-verifiable workflows
Source code	37GB with full git history
Multi-file coverage	100%
Devstral-small-2507 HumanEval pass@1	17.87% → 96.40%
Devstral-small-2507 SWE-Bench Lite	+50% resolved, +69% valid patches

If you're training models for real-world coding tasks, you need training data that looks like real-world coding tasks. And you need to verify it actually works. Gauntlet provides both.

Interested in training on Gauntlet? Book a call.