Want more intelligent visions of your inbox? Subscribe to our weekly newsletters to get what is concerned only for institutions AI, data and security leaders. Subscribe now
A New study from Arizona State University Researchers suggest that the famous logic “COT” in LLMS models may be more than a “fragile mirage” of real intelligence. The research depends on an increasing set of work that doubts the depth of thinking about LLM, but it requires a unique “distribution” lens to test where and why COT systems systematically.
It is very important for applications, the paper exceeds a criticism to provide clear practical guidelines on how to calculate these restrictions when developing LLM applications, from testing strategies to the control role.
The promise and problem of the series of ideas
Cot Prompting, which asks LLM “Step Step Step”, showed great results in complex tasks, leading to a vision that models are involved in human -like conclusions. However, careful inspection often reveals logical contradictions that challenge this view.
Various studies Show that LLMS is repeatedly relied on connotations and clues at the surface level instead of logical procedures. The models generate a reasonable logic by repeating the distinctive symbol patterns they saw during training. However, this approach often fails in the tasks that deviate from familiar templates or when providing unrelated information.
Artificial intelligence limits its limits
Power caps, high costs of the symbol, and delay are reshaped. Join our exclusive salon to discover how the big difference:
- Transforming energy into a strategic advantage
- Teaching effective reasoning for real productivity gains
- Opening the return on competitive investment with sustainable artificial intelligence systems
Securing your place to stay in the foreground: https://bit.ly/4mwngngo
Despite these observations, the researchers argue in the new study that “the systematic understanding of the cause and when Cot fails is still a mystery”, whose study aims to address. The previous work has already shown that LLMS is struggling to generalize its logical capabilities. The paper also notes, “The theoretical and experimental guide shows that COT depends only well when the test inputs share the underlying structures with training data; otherwise, the performance decreases sharply.”
New lens on LLM logic
ASU researchers suggest a new lens to display this problem: COT is not really thinking but an advanced form of matching patterns, mainly obligated to statistical patterns in their training data. They assume that “Cot’s success stems from the inherent thinking capacity in the model, but from its ability to generalize it is conditional on cases of external tests (OOD) that resemble the structural standard models in the distribution.” In other words, LLM is good in applying old styles on new data that looks similar, but not to solve new problems.

To test this hypothesis, they dissected COT capabilities across three dimensions of “distribution transformation” (changes between training data and test data). First, tested the “generalization of tasks” to see if the model can apply the thinking process learned to a new type of task. Second, they examined the “generalization of length” to determine whether he could deal with thinking chains that are longer or shorter than those trained. Finally, they evaluated the “Coordination Circular” to measure the extent of the model’s sensitivity to simple changes in the formulation or structure of the claim.
For their analysis, they developed a framework called Dataalchemy For the smaller LLMS training from the zero point in a governed environment, allowing them to accurately measure how to deteriorate the performance when clicking on training data.
“The lens of distributing the data and the environment controlled by everyone we were trying to transfer,” Qinjui Zhao, a PhD at Arizona State University and co -author of the paper, told Venturebeat. “We hope to create space as the audience, researchers and developers can explore and search freely the nature of LLMS and progress within the limits of human knowledge.”
Mirage confirmed
Based on the results they reached, the researchers conclude that thinking about COT is “an advanced form of matching patterns, mainly limited by the distribution of data seen during training.” When tested even outside this distribution a little, the performance collapses. What appears to be more organized thinking is more than a mirage, “exit from the preserved or distorted patterns in training data rather than logical inference.”
The collapse was consistent with the three dimensions. In the new tasks, the models failed to circulate and instead repeated the closest patterns they saw during training. When facing thinking chains of different lengths, they struggle, and they often try to add or remove steps artificially to match the length of their training examples. Finally, their performance has proven very sensitive to surface changes in the claim, especially the differences in basic elements and instructions.

Interestingly, the researchers found that these failures can be fixed quickly. By refining the models on a very small sample of new invisible data through Polishing (SFT), the performance of this specified type of the problem rapidly increased. However, this rapid reform supports more patterns matching theory, indicating that the model does not learn to think more abstractly, but instead it preserves a new pattern to overcome a specific weakness.
Fast food for the institution
The researchers provide a direct warning to practitioners, with a highlight of “the risk of relying on COT as a connection solution and operation of the tasks of thinking and caution against the equality of the output similar to COT with human thinking.” It provides three main tips for developers to build applications with LLMS.
1)Beware of excessive dependence and wrong confidence. COT should not be dealt with as a reliable unit to think about high risk fields such as financing or legal analysis. LLMS can produce “fluent nonsense” (reasonable but logical reasonable) more deceitful than completely incorrect answer. The authors emphasize that “enough scrutiny of field experts is indispensable.”
“The progress of science should remain the human axis-the machines can help, but the discovery is still threatening humanity and curiosity,” said Zhao.
2) PReoard the output test (OOD). Standard health verification is not enough, as training data reflects test data, real durability. The developers must carry out strict tests that systematically benefit from failure through the differences in the task, length and coordination.
3)Learn about air conditioning as a correction, not a drug. While the SFT control can “correct” the model’s performance quickly to the distribution of new specific data, it does not create a real circular. It simply expands the “distribution bubble” of the model a little. Dependence on SFT to reform every OOS failure is an unlimited strategy that fails to address the basic lack of model to abstract thinking.
Although Cot is not a form of human perception, this restriction can be managed. Most institutions applications include a relatively narrow and predictable set of tasks. The results of the paper provide a plan to ensure reliability within these areas. The developers can build strict evaluation suites that systematically test the performance of the model for the task, length and specific coordination that their application will face. This allows them to set the boundaries of the “internal” comfort area of the model and determine the place where they correspond to their specific needs.
This target test converts its control process from an interactive “correction” to a proactive strategy for alignment. When assessments reveal a specific twice, developers can create small target SFT data sets to process them. Instead of trying to achieve widespread general thinking, this SFT approach is surgically used to ensure the compatibility of the capabilities of the model accurately with the features of a specific institution’s mission. In the end, the study provides a practical lens to overcome LLM and engineering applications to achieve expected success.
https://venturebeat.com/wp-content/uploads/2025/08/Broken-CoT.jpg?w=1024?w=1200&strip=all
Source link