| Model | Final score | Trajectory (per try) | Total time | Total tokens | Notes | Files |
|---|---|---|---|---|---|---|
| google/gemini-2.5-flash full · google_gemini-2.5-flash__full |
13/13 | first-try 13/13 | 10.4s | 2.7k | play · raw · plan · score · meta | |
| x-ai/grok-4.3 full · x-ai_grok-4.3__full |
13/13 | first-try 13/13 | 23.7s | 2.6k | play · raw · plan · score · meta | |
| claude-opus full · claude-cli__full |
13/13 | first-try 13/13 | 26.9s | 3.3k | Run via `claude -p`. Cache tokens reflect CLI default-system-prompt overhead, not the eval prompt itself. | play · raw · plan · score · meta |
| mistral-large-3:675b-cloud full · mistral-large-3_675b-cloud__full |
13/13 | first-try 13/13 | 31.4s | 2.2k | play · raw · plan · score · meta | |
| gpt-oss:120b-cloud full · gpt-oss_120b-cloud__full |
13/13 | first-try 13/13 | 46.5s | 3.7k | play · raw · plan · score · meta | |
| openai/gpt-5-mini full · openai_gpt-5-mini__full |
13/13 | first-try 13/13 | 48.3s | 4.6k | play · raw · plan · score · meta | |
| google/gemma-4-26b-a4b-it full · google_gemma-4-26b-a4b-it__full |
13/13 | first-try 13/13 | 1m19s | 2.7k | play · raw · plan · score · meta | |
| google/gemma-4-31b-it full · google_gemma-4-31b-it__full |
13/13 | first-try 13/13 | 1m29s | 2.2k | play · raw · plan · score · meta | |
| kimi-k2.6:cloud full · kimi-k2.6_cloud__full |
13/13 | first-try 13/13 | 3m12s | 3.3k | play · raw · plan · score · meta | |
| deepseek-v4-pro:cloud full · deepseek-v4-pro_cloud__full |
13/13 | first-try 13/13 | 4m05s | 7.7k | play · raw · plan · score · meta | |
| openai/gpt-5-codex full · openai_gpt-5-codex__full |
13/13 | first-try 13/13 | 4m60s | 48.6k | play · raw · plan · score · meta | |
| gpt-oss:20b full · gpt-oss_20b__full |
13/13 | first-try 13/13 | 12m43s | 6.1k | play · raw · plan · score · meta | |
| anthropic/claude-haiku-4-5 full · anthropic_claude-haiku-4-5__full |
13/13 | base 6/7→r1 13/13 | 25.5s sum of 2 calls |
5.8k | play · raw · plan · score · meta | |
| x-ai/grok-code-fast-1 full · x-ai_grok-code-fast-1__full |
13/13 | base 6/7→r1 13/13 | 46.0s sum of 2 calls |
6.6k | play · raw · plan · score · meta | |
| gemma4:31b-cloud full · gemma4_31b-cloud__full |
13/13 | base 6/7→r1 13/13 | 40.3s sum of 2 calls |
4.7k | play · raw · plan · score · meta | |
| meta-llama/llama-4-scout full · meta-llama_llama-4-scout__full |
13/13 | base 6/7→r1 13/13 | 36.7s sum of 2 calls |
4.0k | play · raw · plan · score · meta | |
| glm-5:cloud full · glm-5_cloud__full |
13/13 | base 0/2→r1 13/13 | 4m22s sum of 2 calls |
32.4k | play · raw · plan · score · meta | |
| qwen3-coder:480b-cloud full · qwen3-coder_480b-cloud__full |
13/13 | base 6/7→r1 13/13 | 7m54s sum of 2 calls |
4.2k | play · raw · plan · score · meta | |
| nemotron-3-super:cloud full · nemotron-3-super_cloud__full |
13/13 | base 6/7→r1 6/7→r2 13/13 | 2m30s sum of 3 calls |
13.4k | play · raw · plan · score · meta | |
| qwen3:8b full · qwen3_8b__full |
13/13 | base 6/7→r1 6/7→r2 11/13→r3 13/13 | 15m49s sum of 4 calls |
27.4k | play · raw · plan · score · meta | |
| google/gemma-3-27b-it full · google_gemma-3-27b-it__full |
6/7 | base 6/7→r1 6/7→r2 6/7→r3 6/7 | 4m47s sum of 4 calls |
9.1k | failed: ts_compiles | play · raw · plan · score · meta |
| gemma4:e2b-it-q8_0 full · gemma4_e2b-it-q8_0__full |
6/7 | base 6/7→r1 6/7→r2 6/7→r3 6/7 | 10m17s sum of 4 calls |
18.6k | failed: ts_compiles | play · raw · plan · score · meta |
| qwen3-coder-next:cloud full · qwen3-coder-next_cloud__full |
6/7 | base 6/7→r1 6/7→r2 6/7→r3 6/7 | 4m55s sum of 4 calls |
8.2k | failed: ts_compiles | play · raw · plan · score · meta |
| qwen2.5-coder:7b full · qwen2.5-coder_7b__full |
11/13 | base 11/13→r1 11/13→r2 11/13→r3 11/13 | 2m38s sum of 4 calls |
5.8k | failed: space_starts_game, arrow_steers | play · raw · plan · score · meta |
| gemma4:e2b full · gemma4_e2b__full |
11/13 | base 6/7→r1 6/7→r2 6/7→r3 11/13 | 4m18s sum of 4 calls |
15.1k | failed: space_starts_game, arrow_steers | play · raw · plan · score · meta |
| ? ? · google_gemma-4-e2b__full |
0/0 | first-try 0/0 | — | — | play · raw · plan · score · meta | |
| google/gemma-4-26b-a4b full · google_gemma-4-26b-a4b__full |
0/2 | base 0/2→r1 0/2→r2 0/2→r3 0/2 | 32m50s sum of 4 calls |
— | failed: index_html_exists, snake_ts_exists | play · raw · plan · score · meta |
| gemma4:26b full · gemma4_26b__full |
0/2 | base 0/2→r1 0/2→r2 0/2→r3 0/2 | 122m03s sum of 4 calls |
65.5k | failed: index_html_exists, snake_ts_exists | play · raw · plan · score · meta |
| google/gemma-4-31b full · google_gemma-4-31b__full |
0/2 | base 0/2→r1 0/2→r2 0/2→r3 0/2 | 271m29s sum of 4 calls |
— | failed: snake_ts_exists | play · raw · plan · score · meta |