Teaching AI agents to ask better questions by playing “Battleship”
Researchers at MIT's Computer Science and Artificial Intelligence Laboratory and Harvard University's School of Engineering and Applied Sciences have completed a systematic investigation into how large language models formulate questions and gather information, using the childhood board game Battleship as their experimental framework. Conducted during 2025 and extending into early 2026, this project represents a meaningful shift in how the academic community evaluates artificial intelligence systems designed to operate as autonomous agents. The core finding challenges the prevailing assumption that bigger models inherently perform better: a relatively compact system called Llama 4 Scout, operating at approximately one percent of the computational cost of frontier models like GPT-5, ultimately demonstrated superior performance in strategic questioning and information-seeking tasks. This outcome carries particular weight given the mounting pressure on organizations to justify the substantial capital expenditure required to develop and deploy increasingly massive language models. The research directly addresses a critical gap in how contemporary AI systems approach problem-solving in domains where asking the right questions matters as much as processing the correct answers.
The significance of this research emerges from a fundamental limitation in how current large language models approach uncertain environments requiring iterative information gathering. While contemporary AI systems excel at pattern matching and text generation across domains where the solution space is relatively well-defined, they struggle considerably when deployed in high-stakes scenarios like medical diagnosis or scientific discovery, where systematic inquiry and hypothesis testing prove essential. The Battleship research builds directly on decades of cognitive science literature examining how humans learn through strategic questioning, transforming a simple parlor game into a rigorous testbed for evaluating machine reasoning. The motivation underlying this investigation reflects broader industry anxieties about whether larger models genuinely develop superior reasoning capabilities or merely accumulate greater parameter counts without meaningfully advancing practical problem-solving abilities. In an era where training costs for frontier models have escalated into the hundreds of millions of dollars, demonstrating that smaller systems can outperform expensive alternatives at information-seeking tasks carries considerable implications for resource allocation in artificial intelligence research and development.
The experimental methodology involved collaboration between over 40 human participants who played a modified version of Battleship designed around natural language interaction, with one player assuming the role of captain asking strategic questions while another served as spotter providing real-time responses. The researchers constructed the BattleshipQA dataset from this human gameplay, generating a robust benchmark for evaluating model performance across both question formulation and answer accuracy. When tested without any specialized training or fine-tuning, state-of-the-art models including GPT-5 demonstrated the capacity to complete individual games using fewer total turns than human players, yet this surface-level advantage obscured a critical deficiency: smaller models like Llama 4 Scout proved substantially less adept at generating questions that efficiently narrowed the solution space. The raw performance data proved illuminating—Llama 4 Scout achieved only an 8 percent win rate against human players using standard inference methods, starkly underperforming frontier systems and revealing the pronounced disadvantage smaller models faced when required to reason strategically about information value.
The transformative improvement came through implementation of a Monte Carlo inference strategy that fundamentally altered how models evaluated and selected questions. Rather than allowing systems to choose their next inquiry based on surface-level probability distributions, the refinement required models to carefully calculate the likelihood of each potential outcome given the accumulated information from previous responses. When this approach was applied to Llama 4 Scout, the model's win rate against human players surged from 8 percent to 82 percent, representing a tenfold improvement that fundamentally reframes the practical calculus underlying model selection and deployment. Equally remarkable, this enhanced questioning capability enabled the smaller model to reliably outperform GPT-5, the frontier system, while consuming roughly one-hundredth of the computational resources required by the larger alternative. The researchers simultaneously improved answer accuracy across all tested models by implementing a code verification mechanism that converted natural language questions into executable logical statements, ensuring models could explicitly validate their responses against known information rather than relying on implicit pattern matching. This dual improvement—superior question formulation combined with enhanced answer verification—produced average accuracy gains of 15 percent across the model cohort.
These findings illuminate a broader pattern emerging within AI research suggesting that architectural scale alone provides insufficient explanation for reasoning capability or practical utility. The Battleship results demonstrate that strategic inference methodologies and architectural design choices often exert greater influence on real-world performance than parameter count or training data volume. This pattern carries profound implications for how organizations prioritize their research agendas and capital allocation strategies, suggesting that marginal improvements to inference and reasoning methodology can rival or exceed the benefits derived from orders-of-magnitude increases in model scale. The research also reveals persistent vulnerabilities in how current language models approach the fundamental cognitive task of information-seeking, even at the frontier of model development, indicating that genuine progress in autonomous agent capability requires deeper theoretical understanding of how systems should formulate inquiries in uncertain environments rather than simply scaling existing architectures. This represents a conceptual realignment in the field, shifting focus from the assumption that bigger necessarily means better toward a more nuanced evaluation of how different design choices enable models to reason strategically about what they do not know.
The practical implications of this research extend across multiple dimensions that warrant close monitoring through 2026 and beyond. Organizations developing customer service and technical support applications should evaluate whether implementing Monte Carlo inference strategies on smaller language models might deliver superior cost-efficiency compared to deploying frontier systems, with concrete performance comparisons likely to emerge as additional research groups replicate and extend the MIT and Harvard findings. The broader AI community should closely observe whether other inference methodologies can produce similar improvements, particularly approaches grounded in information theory or Bayesian reasoning frameworks that might enable even more substantial gains in strategic questioning capability. Additionally, researchers at major AI development organizations including OpenAI, Anthropic, and Google DeepMind will likely investigate whether these inference improvements scale to larger models, potentially unlocking more effective deployment strategies for applications requiring complex information-gathering in medical diagnostics, scientific research, and other high-stakes domains where the cost of inefficient questioning carries substantial real-world consequences.