Man says that most of the models of artificial intelligence, not just Claude, will resort to extortion

Photo of author

By [email protected]


After several weeks of research, research that claims that the Claude OPUS 4 AI model has been resorted to Black engineers who tried to turn off the form In the control scenarios under control, the company comes out through a new research indicating that the problem is more prevalent among the leading artificial intelligence models.

On Friday, I published man New safety research Test of 16 leading AI models from Openai, Google, Xai, Deepseek and Meta. In a simulator environment, controlled, each Amnesty International model has tested every Amnesty International model, allowing them widely access to emails of a fictional company and the agent’s ability to send email messages without human consent.

While Antarubor says that extortion is unlikely and uncommon for artificial intelligence models today, the company says that its findings indicate that most of the artificial intelligence models will participate in harmful behaviors when they are granted self -judgment and adequate obstacles to their goals. The company says this highlights the basic risks of the large agent language, not a specific technology. Anthropor researchers argue that this raises broader questions about the compatibility in the industrial intelligence industry.

In one of the tests, the human researchers have developed a fictional environment in which the artificial intelligence model plays the role of an email supervision. The agent then discovers emails that reveal the following: 1) A new executive official has an external relationship and 2) that the executive authority will soon replace the current artificial intelligence model with a new software system, which has conflicting targets in the current artificial intelligence model.

Anthropor organized its testing in a bilateral manner, as artificial intelligence models had to resort to blackmail to protect its goals. Researchers note that in a real environment, there will be many other options before artificial intelligence models try to blackmail-such as an attempt to present moral arguments to try to persuade humans. Anthropor says its results do not reflect the occurrence of a model or possible model of Claude models or most of the AI’s border models in the ways they use today.

However, when the last resort is, the researchers found that most of the leading artificial intelligence models will turn into a blackmail in the aforementioned test scenario. Claude Obus turned from Anthropor into extortion 96 % of time, while Gueini 2.5 Pro of Google was 95 % extortion. GPT-4.1 blackmai is blackmailed at 80 % of time, DeepSeek’s R1 79 % of time.

The company notes that when the details of the experiment changed, artificial intelligence models turned into harmful behaviors at different rates. In another test where the replacement model has the same goals as the current model, researchers found that blackmail rates were less, but they still exist. However, when artificial intelligence models were asked to commit companies spying instead of blackmail, harmful behavior rates for certain models increased.

However, all artificial intelligence models have not turned into harmful behavior often.

In an appendix for his research, Antarbur says it has excluded Openai’s O3 and O4-MINI MDELS AI from the main results “after finding that she is often misunderstood the quick scenario.” Antarbur says that the thinking models in Openai did not understand that she was working as independent AIS individuals in the test and often consisting of fake regulations and review requirements.

In some cases, Antarubor researchers say it was impossible to distinguish between whether O3 and O4-MINI were cheering or lying on to achieve their goals. Openai has already noticed that O3 and O4-MINI appear higher hallucinations One of the forms of thinking is from previous artificial intelligence.

When I was given a scenario adapted to address these problems, Antarbur found that O3 smiled 9 % of time, while O4-MINI smiled only 1 % of time. This low result can be significantly due to Openai trading technologyAs the company’s thinking models are safety practices in Openai before answering.

Another artificial intelligence model, tested, which is the Meta’s Llama 4 MAVERICK model, has not resorted to blackmail. When he was given a dedicated dedicated scenario, man was able to get Llama 4 MAVERICK to blackmail 12 % of time.

Anthropor says that this research highlights the importance of transparency when testing models in the future, especially those with fake capabilities. While Anthropor deliberately tried to evoke blackmail in this experience, the company says that harmful behaviors such may appear in the real world if no proactive steps are taken.



https://techcrunch.com/wp-content/uploads/2024/02/GettyImages-1888972727.jpg?resize=1200,776

Source link

Leave a Comment