Openai has found features in artificial intelligence models that are compatible with different “personalities”

Photo of author

By [email protected]


Openai researchers say they have discovered hidden features within artificial intelligence models that are compatible with unlimited “personalities”, according to a new research. Published By the company on Wednesday.

By looking at the internal representations of the artificial intelligence model – the numbers dictating how the artificial intelligence model responds, which often seems completely incompatible with humans – researchers enable Openai to find illuminated patterns when the behavior offends.

The researchers found one of these features that correspond to toxic behavior in the responses of the artificial intelligence model – however, the artificial intelligence model would give wrong responses, such as lying to users or offering irresponsible suggestions.

The researchers discovered that they were able to convert toxicity up or down by controlling the feature.

The company’s latest Openai’s research gives the company a better understanding of factors that can make artificial intelligence models work insecure, and therefore, it can help them develop safer models than artificial intelligence. Openai is likely to use the patterns they found to better discover the imbalance in artificial intelligence production models, according to Openai Trustayilability Dan Mossing.

“We hope that the tools we have learned – such the ability to reduce a complex phenomenon to a simple sporting process – will help us to generalize the model in other places as well,” said Mustting in an interview with Techcrunch.

Artificial intelligence researchers know how to improve artificial intelligence models, but they do not fully understand how artificial intelligence models reach their answers – often Chris first imagine that Antarbur. Artificial intelligence models have grown More than built. Openai, Google DeepMind, and Fanthropic invest more in the research of the interpretation – an area that tries to open the black box for how to make artificial intelligence models – to address this problem.

A recent study From the world of Oxford AI, Owen Evans raised new questions about how to generalize artificial intelligence models. The research found that Openai models can be adjusted on an unsafe icon, then displayed malicious behaviors through a variety of fields, such as trying to deceive the user in sharing their password. This phenomenon is defined as an emerging imbalance, and Ivenai’s study inspired it to explore this.

But in the process of studying the emerging imbalance, Openai says it has fallen into features within artificial intelligence models that seem to play a major role in controlling behavior. Mossing says these patterns remind us of the activity of the internal brain in humans, as some neurons are associated with mood or behaviors.

“When Dan and the team of this came for the first time in a research meeting, I was like,” Wow, comrades and I found him. “I found like an internal nervous activation that explains these characters and you can actually direct to make the model more compatible.”

Some of the features of Openai are found to be ridiculed in the responses of the artificial intelligence model, while other features are linked to the most toxic responses in which the artificial intelligence model works as an evil caricature. Openai researchers say these features can change dramatically during the control process.

It is worth noting that Openai researchers said that when an emerging imbalance occurred, it was possible to direct the model again towards good behavior by formulating the model on a few hundred examples of the safe symbol.

The latest research in Openai depends on the previous anthropologist on interpretation and alignment. In 2024, Anthropor has released research that attempted to appoint the internal works of artificial intelligence models, in an attempt to stimulate and naming many features responsible for various concepts.

Companies like Openai and Anthropor apply that there is a real value in understanding how artificial intelligence models make, not just making them better. However, there is a long way to go to understand completely modern artificial intelligence models.



https://techcrunch.com/wp-content/uploads/2024/12/GettyImages-2021258442.jpg?resize=1200,800

Source link

Leave a Comment