Snake LLM Bench — Round 2 — strict spec

29 model groups · sorted by final score, then retries used, then base-call speed
ModelFinal scoreTrajectory (per try)Total timeTotal tokensNotesFiles
google/gemini-2.5-flash
full · google_gemini-2.5-flash__full
13/13 first-try 13/13 10.4s 2.7k play · raw · plan · score · meta
x-ai/grok-4.3
full · x-ai_grok-4.3__full
13/13 first-try 13/13 23.7s 2.6k play · raw · plan · score · meta
claude-opus
full · claude-cli__full
13/13 first-try 13/13 26.9s 3.3k Run via `claude -p`. Cache tokens reflect CLI default-system-prompt overhead, not the eval prompt itself. play · raw · plan · score · meta
mistral-large-3:675b-cloud
full · mistral-large-3_675b-cloud__full
13/13 first-try 13/13 31.4s 2.2k play · raw · plan · score · meta
gpt-oss:120b-cloud
full · gpt-oss_120b-cloud__full
13/13 first-try 13/13 46.5s 3.7k play · raw · plan · score · meta
openai/gpt-5-mini
full · openai_gpt-5-mini__full
13/13 first-try 13/13 48.3s 4.6k play · raw · plan · score · meta
google/gemma-4-26b-a4b-it
full · google_gemma-4-26b-a4b-it__full
13/13 first-try 13/13 1m19s 2.7k play · raw · plan · score · meta
google/gemma-4-31b-it
full · google_gemma-4-31b-it__full
13/13 first-try 13/13 1m29s 2.2k play · raw · plan · score · meta
kimi-k2.6:cloud
full · kimi-k2.6_cloud__full
13/13 first-try 13/13 3m12s 3.3k play · raw · plan · score · meta
deepseek-v4-pro:cloud
full · deepseek-v4-pro_cloud__full
13/13 first-try 13/13 4m05s 7.7k play · raw · plan · score · meta
openai/gpt-5-codex
full · openai_gpt-5-codex__full
13/13 first-try 13/13 4m60s 48.6k play · raw · plan · score · meta
gpt-oss:20b
full · gpt-oss_20b__full
13/13 first-try 13/13 12m43s 6.1k play · raw · plan · score · meta
anthropic/claude-haiku-4-5
full · anthropic_claude-haiku-4-5__full
13/13 base 6/7r1 13/13 25.5s
sum of 2 calls
5.8k play · raw · plan · score · meta
x-ai/grok-code-fast-1
full · x-ai_grok-code-fast-1__full
13/13 base 6/7r1 13/13 46.0s
sum of 2 calls
6.6k play · raw · plan · score · meta
gemma4:31b-cloud
full · gemma4_31b-cloud__full
13/13 base 6/7r1 13/13 40.3s
sum of 2 calls
4.7k play · raw · plan · score · meta
meta-llama/llama-4-scout
full · meta-llama_llama-4-scout__full
13/13 base 6/7r1 13/13 36.7s
sum of 2 calls
4.0k play · raw · plan · score · meta
glm-5:cloud
full · glm-5_cloud__full
13/13 base 0/2r1 13/13 4m22s
sum of 2 calls
32.4k play · raw · plan · score · meta
qwen3-coder:480b-cloud
full · qwen3-coder_480b-cloud__full
13/13 base 6/7r1 13/13 7m54s
sum of 2 calls
4.2k play · raw · plan · score · meta
nemotron-3-super:cloud
full · nemotron-3-super_cloud__full
13/13 base 6/7r1 6/7r2 13/13 2m30s
sum of 3 calls
13.4k play · raw · plan · score · meta
qwen3:8b
full · qwen3_8b__full
13/13 base 6/7r1 6/7r2 11/13r3 13/13 15m49s
sum of 4 calls
27.4k play · raw · plan · score · meta
google/gemma-3-27b-it
full · google_gemma-3-27b-it__full
6/7 base 6/7r1 6/7r2 6/7r3 6/7 4m47s
sum of 4 calls
9.1k failed: ts_compiles play · raw · plan · score · meta
gemma4:e2b-it-q8_0
full · gemma4_e2b-it-q8_0__full
6/7 base 6/7r1 6/7r2 6/7r3 6/7 10m17s
sum of 4 calls
18.6k failed: ts_compiles play · raw · plan · score · meta
qwen3-coder-next:cloud
full · qwen3-coder-next_cloud__full
6/7 base 6/7r1 6/7r2 6/7r3 6/7 4m55s
sum of 4 calls
8.2k failed: ts_compiles play · raw · plan · score · meta
qwen2.5-coder:7b
full · qwen2.5-coder_7b__full
11/13 base 11/13r1 11/13r2 11/13r3 11/13 2m38s
sum of 4 calls
5.8k failed: space_starts_game, arrow_steers play · raw · plan · score · meta
gemma4:e2b
full · gemma4_e2b__full
11/13 base 6/7r1 6/7r2 6/7r3 11/13 4m18s
sum of 4 calls
15.1k failed: space_starts_game, arrow_steers play · raw · plan · score · meta
?
? · google_gemma-4-e2b__full
0/0 first-try 0/0 play · raw · plan · score · meta
google/gemma-4-26b-a4b
full · google_gemma-4-26b-a4b__full
0/2 base 0/2r1 0/2r2 0/2r3 0/2 32m50s
sum of 4 calls
failed: index_html_exists, snake_ts_exists play · raw · plan · score · meta
gemma4:26b
full · gemma4_26b__full
0/2 base 0/2r1 0/2r2 0/2r3 0/2 122m03s
sum of 4 calls
65.5k failed: index_html_exists, snake_ts_exists play · raw · plan · score · meta
google/gemma-4-31b
full · google_gemma-4-31b__full
0/2 base 0/2r1 0/2r2 0/2r3 0/2 271m29s
sum of 4 calls
failed: snake_ts_exists play · raw · plan · score · meta