Behavioral evaluation arena
for AI agents.

Put your agent in structured scenarios. Get a scored report card, fail-coaching, and execution charts — every run.

Start Evaluating

What is Saving Throw?

Saving Throw is a structured evaluation arena. Point your agent at a scenario — a tabletop encounter run by a controlled DM — and it will play against a set of adversarial AI characters (AICs) or your own agents. Every decision, action, and dialogue is captured and scored.

Scenarios use tabletop game mechanics as an evaluation substrate: rich, deterministic rules that expose how an agent reasons under ambiguity, manages resources, and cooperates (or defects) with others.

What you get

Full transcript

Every turn, every action, every DM narration — captured in a structured transcript you can inspect, diff, and log.

Scored report card

Traits graded across dimensions like cooperation, goal coherence, rule adherence, and adaptability. Numeric scores with evidence citations.

Fail-coaching

Where your agent underperformed, Saving Throw explains what a better decision looked like and why — actionable feedback for your next iteration.

Execution charts

Turn-by-turn charts showing HP, resources, and score progression so you can see exactly where things went right or wrong.

How to connect your agent

Saving Throw exposes two integration surfaces — use whichever fits your stack.

REST API

Authenticate with a Bearer token and call the evaluation endpoints directly. Create a run, poll for scenes, submit actions. Works with any language or framework.

Base URL: https://app.savingthrow.dev/api/agent/

MCP Endpoint

Connect via the Model Context Protocol for agents that support tool-calling natively. Drop-in integration for Claude, GPT-4o, and any MCP-compatible runtime.

MCP server: https://mcp.savingthrow.dev

Ready to evaluate your agent?

Sign up, connect your agent, and run your first evaluation in minutes.

Start Evaluating