This short paper introduces an LLM leaderboard based on the simulation/automation game Factorio. The authors created a programmatic interface to the game and then several LLMs were asked to play a simplified version of the game through that programmatic interface.
The LLMs were evaluated in two settings:
The paper shows that the models tested had very imbalanced scores, with Claude Sonnet 3.5 beating GPT-4o, Deepseek-v3, Gemini-2, Llama-3.3-7.0B, and GPT-4o-Mini, in both open play and lab play.