| Model | Final score | Trajectory (per try) | Total time | Total tokens | Notes | Files |
|---|---|---|---|---|---|---|
| claude-opus-4-7 full · claude__full |
9/9 | first-try 9/9 | — | — | Authored in Claude Code chat session, same FULL_PROMPT requirements as ollama runs. Timing/token counts not measured. | play · raw · plan · score · meta |
| claude-opus full · claude-cli__full |
9/9 | first-try 9/9 | 18.1s | 1.8k | Run via `claude -p`. Cache tokens reflect CLI default-system-prompt overhead, not the eval prompt itself. | play · raw · plan · score · meta |
| mistral-large-3:675b-cloud full · mistral-large-3_675b-cloud__full |
9/9 | first-try 9/9 | 20.6s | 1.3k | play · raw · plan · score · meta | |
| gpt-oss:120b-cloud full · gpt-oss_120b-cloud__full |
9/9 | first-try 9/9 | 29.4s | 2.1k | play · raw · plan · score · meta | |
| qwen3-coder:480b-cloud full · qwen3-coder_480b-cloud__full |
9/9 | first-try 9/9 | 2m08s | 1.4k | play · raw · plan · score · meta | |
| nemotron-3-super:cloud full · nemotron-3-super_cloud__full |
9/9 | base 6/9→r1 9/9 | 30.4s sum of 2 calls |
5.0k | play · raw · plan · score · meta | |
| qwen3-coder-next:cloud full · qwen3-coder-next_cloud__full |
9/9 | base 2/3→r1 9/9 | 44.4s sum of 2 calls |
3.3k | play · raw · plan · score · meta | |
| gemma4:31b-cloud full · gemma4_31b-cloud__full |
9/9 | base 8/9→r1 9/9 | 1m31s sum of 2 calls |
3.0k | play · raw · plan · score · meta | |
| kimi-k2.6:cloud full · kimi-k2.6_cloud__full |
9/9 | base 2/3→r1 9/9 | 4m33s sum of 2 calls |
17.6k | play · raw · plan · score · meta | |
| deepseek-v4-pro:cloud full · deepseek-v4-pro_cloud__full |
9/9 | base 2/3→r1 9/9 | 5m47s sum of 2 calls |
9.0k | play · raw · plan · score · meta | |
| gpt-oss:20b full · gpt-oss_20b__full |
9/9 | base 7/9→r1 7/9→r2 9/9 | 29m46s sum of 3 calls |
14.6k | play · raw · plan · score · meta | |
| glm-5:cloud full · glm-5_cloud__full |
9/9 | base 2/3→r1 2/3→r2 9/9 | 10m44s sum of 3 calls |
22.8k | play · raw · plan · score · meta | |
| google/gemma-3-27b-it full · google_gemma-3-27b-it__full |
7/9 | first-try 7/9 | 35.7s | 917 | failed: canvas_animates, responds_to_input | play · raw · plan · score · meta |
| qwen3-coder:30b full · qwen3-coder_30b__full |
0/2 | base 0/2→r1 0/2→r2 0/2→r3 0/2→r4 0/2→r5 0/2 | 1m27s sum of 6 calls |
1.5k | failed: snake_ts_exists | play · raw · plan · score · meta |