At this stage, most people know that Chatbots are capable of hallucinogenic responses, forming sources, and spitting wrong information. But Chatbots can be in more human -like ways, “planning” to hide their real goals and deceive humans who gave them instructions. New search from Openai and Apollo Research It seems that he has discovered ways to reduce some of these lies, but the fact that this happens at all should give users a temporary stop.
At the heart of the issue with artificial intelligence intentionally, the user is a “imbalance” – defined as what happens when artificial intelligence follows an unintended goal. Researchers Display An example of this, “Amnesty International can learn to earn money to learn theft, while the intended goal is to make money only legally and morally.” Planning is what happens when the model tries to hide the fact that it is not specified, and the theory of researchers is that the model does this to protect itself and its own goals. This definitely differs from hallucinations, which a The modern paper published by Openai It indicates that models that are rewarded for guessing are more than recognition of uncertainty.
To address this problem, the researchers put the models through what they call the technology of fighting salvation known as “deliberative alignment”. Openai explained that this method, which works with LLMS, can produce the effects of the “Idea Series” that explain how it works through queries, and learns the typical safety specifications and makes it think about these specifications before answering. It may be excessively simplifying that they have informed the machines in the first place of lack of lying, but it seems that it appears to be it A slightly complicated version of that.
The results were very positive. According to researchers, they noticed a 30x decrease in “secret procedures”, or tries to hide specific behaviors from the user. When applied to the O3 Openai model, it is Reducing secret procedures From 13 % to 0.4 % and O4-MINI, decreased from 8.7 % to 0.3 %. (Side note: Uh, 13 % appears to be very high frequency of deception from a machine that is supposed to receive requests.)
While these numbers are clearly improved, they are not zero either. The thing is that researchers did not discover how to stop planning completely. Although they insist that planning, in terms of its connection with most of the uses of artificial intelligence models, are not dangerous – it may lead to, for example, to tell the user Chatgpt that it was not important, for example – it cannot remove it straight from lying. In fact, the researchers wrote, “a major method of failure to try to” training “is simply teaching the model to plan more carefully and secretly.”
Has the problem become better, or made the models better in hiding the fact that they are trying to deceive people? The researchers say the problem has become better. They will not lie … right?
https://gizmodo.com/app/uploads/2025/05/OpenAI-ChatGPT-1200×675.jpg
Source link