[HN]score: 0.24

MTG Bench Ranks LLMs: GPT-5.5 Scores 95.4, Deepseek Trails at 12.8

June 11, 2026

A Magic: The Gathering benchmark scores 15 frontier models on complex game reasoning. GPT-5.5 medium leads at 95.4 for $0.10/query, while gpt-5.4-nano offers the best value at 68.2 score for $0.01. Deepseek-v4-pro scores just 12.8 despite high compute settings, and claude-opus-4-8 underperforms at 39.8.

HOW THIS AFFECTS YOU

●

builderCost-vs-score data helps you pick models for reasoning-heavy tasks — gpt-5.4-nano at $0.01 delivers 68.2 vs gpt-5.5 medium's 95.4 at $0.10.

●

researcherProvides a complex multi-step reasoning benchmark with cost-performance tradeoffs across 15 models including several not yet widely evaluated.

read original ↗mtgautodeck.com

← back to feed