Snake LLM Bench — Round 1 — original prompt

14 model groups · sorted by final score, then retries used, then base-call speed
ModelFinal scoreTrajectory (per try)Total timeTotal tokensNotesFiles
claude-opus-4-7
full · claude__full
9/9 first-try 9/9 Authored in Claude Code chat session, same FULL_PROMPT requirements as ollama runs. Timing/token counts not measured. play · raw · plan · score · meta
claude-opus
full · claude-cli__full
9/9 first-try 9/9 18.1s 1.8k Run via `claude -p`. Cache tokens reflect CLI default-system-prompt overhead, not the eval prompt itself. play · raw · plan · score · meta
mistral-large-3:675b-cloud
full · mistral-large-3_675b-cloud__full
9/9 first-try 9/9 20.6s 1.3k play · raw · plan · score · meta
gpt-oss:120b-cloud
full · gpt-oss_120b-cloud__full
9/9 first-try 9/9 29.4s 2.1k play · raw · plan · score · meta
qwen3-coder:480b-cloud
full · qwen3-coder_480b-cloud__full
9/9 first-try 9/9 2m08s 1.4k play · raw · plan · score · meta
nemotron-3-super:cloud
full · nemotron-3-super_cloud__full
9/9 base 6/9r1 9/9 30.4s
sum of 2 calls
5.0k play · raw · plan · score · meta
qwen3-coder-next:cloud
full · qwen3-coder-next_cloud__full
9/9 base 2/3r1 9/9 44.4s
sum of 2 calls
3.3k play · raw · plan · score · meta
gemma4:31b-cloud
full · gemma4_31b-cloud__full
9/9 base 8/9r1 9/9 1m31s
sum of 2 calls
3.0k play · raw · plan · score · meta
kimi-k2.6:cloud
full · kimi-k2.6_cloud__full
9/9 base 2/3r1 9/9 4m33s
sum of 2 calls
17.6k play · raw · plan · score · meta
deepseek-v4-pro:cloud
full · deepseek-v4-pro_cloud__full
9/9 base 2/3r1 9/9 5m47s
sum of 2 calls
9.0k play · raw · plan · score · meta
gpt-oss:20b
full · gpt-oss_20b__full
9/9 base 7/9r1 7/9r2 9/9 29m46s
sum of 3 calls
14.6k play · raw · plan · score · meta
glm-5:cloud
full · glm-5_cloud__full
9/9 base 2/3r1 2/3r2 9/9 10m44s
sum of 3 calls
22.8k play · raw · plan · score · meta
google/gemma-3-27b-it
full · google_gemma-3-27b-it__full
7/9 first-try 7/9 35.7s 917 failed: canvas_animates, responds_to_input play · raw · plan · score · meta
qwen3-coder:30b
full · qwen3-coder_30b__full
0/2 base 0/2r1 0/2r2 0/2r3 0/2r4 0/2r5 0/2 1m27s
sum of 6 calls
1.5k failed: snake_ts_exists play · raw · plan · score · meta