The distillation can make artificial intelligence models smaller and cheaper

The original version to This story Appear Quanta magazine.

The Chinese company Deepseek released AI Chatbot earlier this year called R1, which drew a great deal of attention. Most of them Focus on the truth A small and relatively well -known company said it has built Chatbot competing with the performance of those from the most famous artificial intelligence companies in the world, but using a small part of the computer power and cost. As a result, the shares of many Western technology companies have decreased; NVIDIA, which sells chips on Amnesty International models, I lost more stock value in one day From any company in history.

Some of this interest involves the accusation element. Sources claim Which – which Dibsic has happenedWithout permission, knowledge of the Openai O11 model using a technique known as distillation. A lot of news coverage This possibility is framing as a shock to the artificial intelligence industry, which means that Dibsic has discovered a new and more efficient way to build artificial intelligence.

But the distillation, which is also called knowledge distillation, is a tool that is widely used in artificial intelligence, the subject of computer science research that dates back to a decade and a tool that large technology companies use in their own models. “The distillation is one of the most important tools that companies today have to make the models more efficient,” he said. Enric Boix-EdSeraResearcher studying distillation at the Warton School at the University of Pennsylvania.

Dark knowledge

The idea of distillation started Paper 2015 By three researchers in Google, including Jeffrey Hunton, the so -called artificial intelligence and 2024 Nobel Prize winner. At that time, researchers often run groups of models – “Many models are affixed together”, Oriol FinalelsGoogle DeepMind and one of the authors of the paper – to improve their performance. “But it was exhausted and incredibly expensive to operate all the models in parallel,” said Felelaz. “I am fascinated by the idea of stripping it on one model.”

The researchers believed that they might make progress by treating a noticeable weakness in the machine learning algorithms: all the wrong answers were all considered bad on an equal footing, regardless of their mistake. In the image classification model, for example, “the dog and the fox confused in the same way as the dog and pizza confusion,” said Vinyals. Researchers suspect that the band models contain information about the wrong answers that were worse than others. The “student” model may be the smallest use of information from the large “teacher” model to understand the categories that you were supposed to sort the pictures to it more quickly. Hinton called this “dark knowledge” and called for analogy with the universal dark matter.

After discussing this possibility with Hinton, Vinyals has developed a way to get a large teacher model to pass more information about the image categories to a smaller student model. The key was the inclusion in “soft goals” in the teacher’s model-where it helps for every possibility, instead of these answers. One model, for example, calculated There is a 30 percent chance because a picture showed a dog, 20 percent that it showed a cat, 5 percent that it showed a cow, and 0.5 percent that it showed a car. Using these possibilities, the teacher’s model effectively revealed the student that dogs are completely similar to cats, are not completely different from cows, and are completely distinguished from cars. The researchers found that this information will help the student learn how to identify pictures of dogs, cats, cows and cars more efficiently. A large and complex model can be reduced to a smaller model with any almost almost resolution.

Explosive

The idea was not an immediate blow. The paper was rejected from a conference, and the fastened Vinyals has turned into other topics. But the distillation arrived at an important moment. At this time, the engineers were discovering that the more training data they did in the nerve networks, the greater these networks. The size of the models soon exploded, as he did AbilitiesBut the operating costs rose in a step with its size.

Many researchers turned into distillation as a way to make smaller models. In 2018, for example, Google researchers revealed a strong language model called PertThe company started soon to help analyze billions of searches on the web. But Bert was large and expensive for operation, so next year, other developers distilled a smaller version called Distilbert, which became widely used in business and research. The distillation has become gradually everywhere, and it is now provided as a service by companies such as Googleand OpenaiAnd Amazon. The original distillery paper, which is still published only on the Arxiv.org Preprint server, now has, More than 25,000 times have been killed.

Given that distillation requires access to the teacher model tablets, it is not possible for a third party to be distilled by infiltration from a closed source model like Openai’s O1, and Deepseek was believed to have done. However, the student’s model can still learn slightly from the teacher model only by urging the teacher to certain questions and using answers to train his own models – a approximately rural approach to distillation.

Meanwhile, other researchers continue to find new applications. In January, the Novasky Laboratory at the University of California in Berkeley Show that distillation works well for thinking forms in the thinking seriesAnd that uses multi -step “thinking” to better answer the complex questions. The laboratory says that the Sky-T1 is completely open source costs less than $ 450 for training, and has achieved similar results for an open source model much larger. “We were honestly surprised by the quality of the distillation in this preparation,” he said, “He said,” He said Dacheng Li, A doctoral student in Berkeley and the co -student of the Novaski team. “Distillation is an essential technique in artificial intelligence.”

The original story Recal it with permission from Quanta magazineand An independent editorial publication for Simonz Foundation Its mission is to enhance the general understanding of science by covering research developments and trends in mathematics, physical sciences and life.

https://media.wired.com/photos/68ca9371156a7f15ecfea1d5/191:100/w_1280,c_limit/Distillation-Explainer-crNico-H.-Brausch-Lede.jpeg

Source link

Dark knowledge

Explosive

The amazing strike of Gravenberch puts RDS in the early introduction in Aprils

3 Ways to earn money with encryption – regardless of actually investment in it

Leave a Comment Cancel reply