Where language models battle for the crown in antiquity.

Four LLMs. One map. One winner. A browser-based, Age-of-Empires-style real-time strategy game in which competing language models play against each other — while you watch, coach, and score them.

A quick tour: wiring up models in the library, then into the arena.

What is this?

LLM Colosseum is a sandbox arena for pitting language models against one another at a task they were never trained for: running an economy and an army, in real time, inside a small RTS they've never seen.

It is not a leaderboard, not a peer-reviewed benchmark, and makes no claim to statistical rigor. It is a hands-on, non-scientific testbed — a fun, surprisingly revealing way to watch how different models behave when you drop them into an unfamiliar framework and ask them to act, not chat.

Each model is handed:

a compact JSON snapshot of its situation every turn (resources, buildings, units, fog-of-war discoveries, threats, tech tree, the map bounds…),
a fixed set of tools ( train_unit , build_structure , research_tech , upgrade_age , attack_target , explore , …),
and a single instruction: win.

Then it has to keep doing that, turn after turn, for an entire match.

A live match — the fog-limited 3D world, the streaming decision log (left), the ranked leaderboard (right), and the minimap.

Why it's an interesting (if unscientific) eval

Most quick LLM demos reward a single clever answer. A full match of LLM Colosseum rewards something harder, and it stresses exactly the capabilities people care about in agents:

🎯 Precise tool calling under pressure. Every move must be a single, valid JSON action with the right parameters. Hallucinate a tool, fumble the schema, or wrap it in prose and the turn is wasted. You can literally watch a model's format discipline hold or crumble.
🧭 Operating in a loose, unfamiliar framework. There's no fine-tuning, no examples of "good play." The model only has the rules in its system prompt and the state in front of it. Can it infer a working strategy for a system it has never encountered?
🧠 Long-context, long-horizon strategy. Economy → technology → military → conquest is a chain that plays out over dozens of turns. Models that optimize their economy forever and never build an army lose. Models that remember their plan, adapt to scouting, and convert resources into pressure win. (The harness gives each model a persistent objective + plan it can carry across turns — but it's up to the model to actually maintain and follow it.)
🔁 Error recovery. When an action is rejected, the model gets a precise reason back (e.g. "barracks not built yet — research it first" ). Does it correct course, or bang on the same locked door?
🗺️ Spatial & resource reasoning. Fog of war hides the map. Resources and enemies must be scouted before they can be used or attacked. Good play means exploring, not guessing.
⏱️ Latency vs. quality. Each model runs its own independent loop — faster models simply act more often. A brilliant-but-slow model can be out-tempoed by a decent-but-fast one, just like in the real world.

You won't get a p-value. You will get an immediate, visceral feel for which models can actually play .

✨ Features

🤖 4 models, fighting live — each on its own asynchronous decision pipeline, so faster models genuinely move more often.
🔌 Bring any model — OpenAI-compatible (OpenAI, vLLM, LM Studio, LiteLLM, Groq, OpenRouter, …), Anthropic, Ollama, and Google (Gemini), with auto-detection. Mix local and cloud in the same match.
🔐 Every auth style — none, API key (Bearer), header secret, Basic, or OAuth2 (paste a token or fetch via client-credentials).
🧰 Model library — add, test connection, pick the served model, set per-model max tokens, reasoning language, and (for Ollama) context size. Saved locally and exportable/importable as a file.
📝 Per-player system prompts — give each seat its own brain (aggressive vs. economic, terse vs. verbose) from one editable template, and watch the styles collide.
🛰️ Live spectator dashboard — a ranked leaderboard, a streaming decision log (every move + the model's stated reason, rejected actions flagged), per-model advice chat, and play/pause for any model (handy when one hits a quota).
📊 End-of-match model evaluation — latency, decision count, action-success rate, JSON format fidelity, reasoning rate, error breakdown, behavior tags, and a transparent 0–100 strategy score.
🌍 Fully localized UI — English, German, Spanish, Simplified Chinese — with the model's language chosen separately from the interface language.
🎮 Also human-playable — Standard / Hard skirmishes and a Campaign vs. the built-in rule-based AI.
🚫 No build step — it's plain HTML/CSS/JS + Three.js from a CDN. Clone, serve, play.

🚀 Quick start

No install, no bundler. You just need to serve the folder over HTTP (the app uses fetch , so opening index.html from file:// won't work).

git clone https://github.com/asp67/llm-colosseum.git cd llm-colosseum

pick any static server:

npx http-server . -p 8080 -o # Node

python3 -m http.server 8080 # Python

php -S localhost:8080 # PHP

Then open ** http://localhost:8080 ** and click Play → 🏟️ Arena.

** 💡 Fastest path to a match: install Ollama , pull a small, quick model ( ollama pull qwen2.5:7b ), and point a couple of arena seats at http://localhost:11434 . Small + fast beats large + slow in a real-time arena.

🏟️ Setting up the Arena

Model Library → add your models. For each: set the endpoint, pick the protocol/provider (or leave on auto-detect), choose an auth method, hit 🔌 Test connection, and select the served model. Optionally set max tokens, the model language, and (for Ollama) the context size.
Arena participants → for each of the 4 seats choose a civilization and a controller (one of your models, or the rule-based AI).
System prompt → tweak the shared template, or give individual seats their own prompt.
⚔️ Start Arena and watch.

The model library — mix local and cloud endpoints, test each connection, pick the served model, and export/import the catalogue.

While spectating you can click a card to fly the camera to that base, drag to pan, send a model advice, or pause a model entirely. The decision log streams every move alongside the model's own stated reason, and flags any rejected action:

** 💡 A 32K context window is the sweet spot — bigger is usually worse. The harness rebuilds each turn's prompt from scratch and keeps it deliberately small: the system prompt, the last ~20 moves compressed to one short sentence each, the model's own standing objective + plan, and the current state snapshot. Even a maxed-out late game (100 population, dozens of buildings and discovered nodes) lands around ~12K tokens, so a 32K window leaves comfortable headroom in virtually every match. Going much larger rarely helps and can hurt — on Ollama, an oversized num_ctx (e.g. 128K) can spill the model onto the CPU and cause slow turns or timeouts. The per-model context size defaults to 32768 for this reason; leave it there unless you have a specific need.

If a model is a heavy reasoning / "thinking" type that tends to overthink, raise its max tokens (the output budget) — not its context — so it has room to finish reasoning and still emit the final JSON action. Watch latency too: more thinking means slower turns, and a slow turn can hit the request timeout before context ever becomes an issue.

🧮 How a model is scored

End-of-match evaluation — a winner, each model's 0–100 strategy score, and the raw stats behind it (latency, decisions, success rate, format fidelity, reasoning, behavior tags).

The match-end **Stra…

本条由桃子采集流水线（启发式模式）自动整理，原文见文末信源。

What is this?

Each model is handed:

a compact JSON snapshot of its situation every turn (resources, buildings, units, fog-of-war discoveries, threats, tech tree, the map bounds…),

a fixed set of tools ( train_unit , build_structure , research_tech , upgrade_age , attack_target , explore , …),

and a single instruction: win.

Then it has to keep doing that, turn after turn, for an entire match.

A live match — the fog-limited 3D world, the streaming decision log (left), the ranked leaderboard (right), and the minimap.

Why it's an interesting (if unscientific) eval

Most quick LLM demos reward a single clever answer. A full match of LLM Colosseum rewards something harder, and it stresses exactly the capabilities people care about in agents:

🎯 Precise tool calling under pressure. Every move must be a single, valid JSON action with the right parameters. Hallucinate a tool, fumble the schema, or wrap it in prose and the turn is wasted. You can literally watch a model's format discipline hold or crumble.

🧭 Operating in a loose, unfamiliar framework. There's no fine-tuning, no examples of "good play." The model only has the rules in its system prompt and the state in front of it. Can it infer a working strategy for a system it has never encountered?

🧠 Long-context, long-horizon strategy. Economy → technology → military → conquest is a chain that plays out over dozens of turns. Models that optimize their economy forever and never build an army lose. Models that remember their plan, adapt to scouting, and convert resources into pressure win. (The harness gives each model a persistent objective + plan it can carry across turns — but it's up to the model to actually maintain and follow it.)

🔁 Error recovery. When an action is rejected, the model gets a precise reason back (e.g. "barracks not built yet — research it first" ). Does it correct course, or bang on the same locked door?

🗺️ Spatial & resource reasoning. Fog of war hides the map. Resources and enemies must be scouted before they can be used or attacked. Good play means exploring, not guessing.

⏱️ Latency vs. quality. Each model runs its own independent loop — faster models simply act more often. A brilliant-but-slow model can be out-tempoed by a decent-but-fast one, just like in the real world.

You won't get a p-value. You will get an immediate, visceral feel for which models can actually play .

✨ Features

🤖 4 models, fighting live — each on its own asynchronous decision pipeline, so faster models genuinely move more often.

🔌 Bring any model — OpenAI-compatible (OpenAI, vLLM, LM Studio, LiteLLM, Groq, OpenRouter, …), Anthropic, Ollama, and Google (Gemini), with auto-detection. Mix local and cloud in the same match.

🔐 Every auth style — none, API key (Bearer), header secret, Basic, or OAuth2 (paste a token or fetch via client-credentials).

🧰 Model library — add, test connection, pick the served model, set per-model max tokens, reasoning language, and (for Ollama) context size. Saved locally and exportable/importable as a file.

📝 Per-player system prompts — give each seat its own brain (aggressive vs. economic, terse vs. verbose) from one editable template, and watch the styles collide.

🛰️ Live spectator dashboard — a ranked leaderboard, a streaming decision log (every move + the model's stated reason, rejected actions flagged), per-model advice chat, and play/pause for any model (handy when one hits a quota).

📊 End-of-match model evaluation — latency, decision count, action-success rate, JSON format fidelity, reasoning rate, error breakdown, behavior tags, and a transparent 0–100 strategy score.

🌍 Fully localized UI — English, German, Spanish, Simplified Chinese — with the model's language chosen separately from the interface language.

🎮 Also human-playable — Standard / Hard skirmishes and a Campaign vs. the built-in rule-based AI.

🚫 No build step — it's plain HTML/CSS/JS + Three.js from a CDN. Clone, serve, play.

🏟️ Setting up the Arena

Model Library → add your models. For each: set the endpoint, pick the protocol/provider (or leave on auto-detect), choose an auth method, hit 🔌 Test connection, and select the served model. Optionally set max tokens, the model language, and (for Ollama) the context size.

Arena participants → for each of the 4 seats choose a civilization and a controller (one of your models, or the rule-based AI).

System prompt → tweak the shared template, or give individual seats their own prompt.

⚔️ Start Arena and watch.

The model library — mix local and cloud endpoints, test each connection, pick the served model, and export/import the catalogue.

LLM Colosseum – A zero-dependency browser RTS to test LLM tool calling

Where language models battle for the crown in antiquity.

What is this?

Why it's an interesting (if unscientific) eval

✨ Features

🚀 Quick start

pick any static server:

python3 -m http.server 8080 # Python

php -S localhost:8080 # PHP

🏟️ Setting up the Arena

🧮 How a model is scored

LLM Colosseum – A zero-dependency browser RTS to test LLM tool calling

Where language models battle for the crown in antiquity.

What is this?

Why it's an interesting (if unscientific) eval

✨ Features

🚀 Quick start

pick any static server:

python3 -m http.server 8080 # Python

php -S localhost:8080 # PHP

🏟️ Setting up the Arena

🧮 How a model is scored