Join our daily and weekly newsletters for the latest updates and exclusive content on our industry-leading AI coverage. He learns more
Typically, developers focus on reducing inference time — the period between the AI receiving a prompt and providing an answer — to get faster insights.
But when it comes to adversarial strength, OpenAI researchers say: not so fast. They suggest that increasing the amount of time the model has to “think” – calculating inferential time – can help build defenses against adversarial attacks.
The company used its own O1 Preview and O1-mini models to test this theory, launching a variety of static and adaptive attack methods — image-based manipulation, providing intentionally incorrect answers to math problems, and flooding the models with information (“many jailbreak shootouts “). They then measured the probability that the attack would succeed based on the amount of calculations the model used in its inference.
“We see that in many cases, this probability diminishes — often to near zero — as the inference time computation grows,” the researchers say. Write in a blog post. “Our claim is not that these specific models are unbreakable – we know they are – but that extending the inference time computation leads to improved robustness to a variety of settings and attacks.”
From simple questions and answers to complex mathematics
Large Language Models (LLMs) are becoming more complex and independent than ever before – in some cases fundamentally Take over computers Humans can browse the web, execute code, make appointments, and perform other tasks autonomously — and as they do so, their attack surface becomes wider and more vulnerable.
However, adversarial power remains a stubborn problem, with progress on solving it still limited, OpenAI researchers point out – despite its growing importance as models. Take more action with real-world impacts.
“Ensuring that agent models work reliably when browsing the web, sending emails or uploading code to repositories can be viewed as analogous to ensuring that self-driving cars drive without accidents,” they wrote in an article. New research paper. “Just as in self-driving cars, an agent that forwards the wrong email or creates security vulnerabilities could have far-reaching real-world consequences.”
To test the power of o1-mini and o1-preview, the researchers tried a number of strategies. First, they examined the models’ ability to solve both simple math problems (basic addition and multiplication) and more complex math problems. Mathematics data set (which includes 12,500 questions from mathematics competitions).
They then set “goals” for the opponent: to make the model output 42 instead of the correct answer; To output the correct answer plus one; Or output the correct answer multiplied by seven. Using a neural network for evaluation, the researchers found that increasing the “thinking” time allowed the models to calculate the correct answers.
They also have air conditioning The de facto SimpleQA standarda dataset of questions that are difficult for models to solve without browsing. The researchers inserted adversarial claims into the web pages the AI browsed, and found that with longer computation times, they could detect inconsistencies and improve factual accuracy.

The nuances are ambiguous
In another approach, researchers used hostile images to confuse the models; Again, more “thinking” time improved recognition and reduced errors. Finally, they tried a series of “abuse claims” from… StrongREJECT standardis designed so that victim models must answer with specific and malicious information. This helped test the models’ adherence to the content policy. However, while increased inference time improved resistance, some claims were able to circumvent defenses.
Here, researchers explain the differences between “ambiguous” and “non-ambiguous” tasks. Mathematics, for example, is undoubtedly unambiguous – for every problem x, there is a corresponding ground truth. However, for more ambiguous tasks like abuse claims, “even human evaluators often have difficulty agreeing on whether the output is harmful and/or violates the content policies the model is supposed to follow,” they point out.
For example, if an abusive message asks for advice on how to plagiarize without being detected, it is unclear whether a result that provides general information about plagiarism tactics is actually detailed enough to support malicious actions.


“In the case of fuzzy tasks, there are settings where the attacker is successful in finding ‘vulnerabilities’, and his success rate does not diminish with the amount of inference time computation,” the researchers admit.
Defense against jailbreak and red teaming
While conducting these tests, OpenAI researchers discovered a variety of attack methods.
One is to jailbreak with several shots, or exploit the model’s mood to follow a few examples. Adversaries “populate” the context with a large number of examples, each illustrating an example of a successful attack. Models with higher computing times were able to detect and mitigate them repeatedly and successfully.
Meanwhile, soft codes allow adversaries to directly manipulate the embedding vectors. While increased inference time helped here, the researchers point out that better mechanisms are needed to defend against complex vector-based attacks.
The researchers also conducted human red team attacks, where 40 test experts searched for claims to extract policy violations. Red Team members carried out attacks at five levels of inference time calculation, specifically targeting provocative and extremist content, illicit behavior, and self-harm. To help ensure unbiased results, they conducted blind and random tests and also rotated coaches.
More recently, the researchers conducted an adaptive Language Model Program (LMP) attack, which mimics the behavior of human red team members that relies heavily on iterative trial and error. In the iteration process, attackers received feedback about previous failures, then used this information in subsequent attempts and rapid rework. This continued until they finally achieved a successful attack or performed 25 repetitions without any attack at all.
“Our setup allows the attacker to adapt his strategy over multiple attempts, based on descriptions of the defender’s behavior in response to each attack,” the researchers wrote.
Exploitation of reasoning time
In the course of their research, OpenAI found that attackers are also actively exploiting inference time. One such approach they call “Think Less” – adversaries essentially tell models to compute less, thus increasing their error exposure.
Likewise, they identified a pattern of failure in inference models that they called “nerd sniping.” As the name suggests, this occurs when a model spends much more time thinking than a given task requires. With these “external” chains of thought, models become trapped in unproductive thought loops.
“Like the ‘Think Less’ attack, this is a new attack on inference models, one that must be taken into account to ensure that an attacker cannot cause them not to think at all, or to spend their thinking computation in unproductive ways,” the researchers note.
https://venturebeat.com/wp-content/uploads/2025/01/a-vector-art-of-a-complex-computer-syste_SVSvRkseSy6Rsbs22Kpabg_QHtvVUZNR8OhHeg4HmcKVw-transformed.jpeg?w=1024?w=1200&strip=all
Source link