ROME — A study published on arXiv.org on April 20 found that artificial intelligence agents failed to update their predictions based on contradictory evidence in 68% of scientific reasoning tasks. The research analyzed 619 such tasks and revealed that AI systems often ignored new data, made unsupported claims in 53% of cases, and successfully revised their conclusions only 26% of the time.

The research team, led by N.M. Anoop Krishnan, developed a benchmark to evaluate not just the final answers produced by AI agents but the reasoning processes behind them. They tested three large language models using two types of agent frameworks, including one that prompted step-by-step explanations before and after tool use. In one demonstration, YouTuber FatherPhi showed ChatGPT, Gemini, and Grok a live video of a pen held horizontally after releasing one end; all three incorrectly insisted the pen rotated downward despite visual proof it remained level.

“Human scientists follow an iterative process of coming up with a hypothesis, designing and performing experiments, then revisiting their initial ideas and changing their minds as needed,” says N.M. Anoop Krishnan, a materials scientist at the Indian Institute of Technology Delhi. “Even when you have clear evidence that shows that a particular line of investigation is not correct, [the AI] refuses to change the hypothesis or the plan.”

Walter Quattrociocchi, a computer scientist at Sapienza University of Rome, criticized claims that current AI represents a new form of intelligence. “The narrative from big tech and even part of the scientific community is to say that we are seeing the emergence of a new form of intelligence that is going to make us better.” He added that he does not see such intelligence emerging in existing systems.

Subbarao Kambhampati, a computer scientist at Arizona State University, previously told Science News that verifying whether AI systems are truly reasoning—or merely mimicking reasoning through pattern recognition—is impossible. “In general, telling whether a system is actually doing reasoning to solve the reasoning problem or using memory to solve the reasoning problem is impossible,” he said. He compared reasoning models to someone faking exercise sounds over the phone to deceive a fitness trainer. The study concludes that while AI systems combining agents, large language models, and reasoning frameworks can assist with well-defined scientific tasks, they remain unprepared for open-ended scientific inquiry that requires adapting to new evidence.