Artificial intelligence sound that already converts: the new TTS model enhances 15 % sales for the main brands

Photo of author

By [email protected]


Join the event that the leaders of the institutions have been trusted for nearly two decades. VB Transform combines people who build AI’s strategy for real institutions. Learn more


Getting voices is not only retinal and accurate, but diverse It is still a conflict in Artificial intelligence conversation.

At the end of the day, people want to hear sounds that look like or at least natural, not only the standard of American broadcasting in the twentieth century.

Starting Rim This challenge is dealt with with Arcana Text To-Ticech (TTS), a spoken language model that can quickly generate new “endless” voices of both sexes, diverse ages, languages ​​and languages ​​that depend only on a simple text description of the intended characteristics.

The model helped increase customer sales – for Domino’s and Wingstop – 15 %.

“It is one thing that you have a high -quality model, similar to life, and a real person arranges,” Lili Clifford, CEO of Rime and co -founder, told Venturebeat. “It is another thing that you have a model that can only create one voice, but an endless variation for sounds along the demographic lines.”

Voice model “man behaves”

Rime multimedia and automatic pressure TTS Model He was trained in natural conversations with real people (unlike vocal actors). Users simply write a description directed to the text with a required sound and demographic characteristics.

For example: “I want a 30 -year -old female who lives in California while she is in software” or “Give me the voice of an Australian man.”

“Every time you do it, you will get a different sound,” Kalford said.

The Rime’s Mist V2 TTS model is designed for large and calm applications, allowing institutions to formulate unique sounds of their business needs. “The customer hears a voice that allows a natural and dynamic conversation without the need for a human agent,” Kalford said.

For those looking for options outside the box, at the same time, Rime offers eight leading loudspeakers with unique properties:

  • Luna (female, cool but exciting, Gen-Z Optimist)
  • Celeste (female, warm, relaxed, loving of fun)
  • Orion (male, older, American of African origin, Saeed)
  • Orca (male, 20 years old, seasonal knowledge of EMO in 2000)
  • Astra (female, young, wide -eyed)
  • Esther (female, older, American Chinese, love)
  • Esteel (female, middle age, American of African origin, looks very sweet)
  • Andromeda (female, young, breath, yoga feelings)

The model has the ability to switch between languages, can be whispered, be sarcastic and even ridicule. Arakna can also include laughter in speech when giving the distinctive symbol . This can return various realistic outputs, from “a small muffled laugh to a large stop,” says Rime. It can also explain the model and Even Correctly, although he was not explicitly trained to do so.

“It raises emotion from context,” he writes Rime in an artistic paper. “He laughs, sighs, his concern, breathes with an audible voice and makes hidden mouth noise. He says” Umm “and other natural discrimination. He has emerging behaviors that we still discover. In short, he behaves human.”

Capture natural conversations

Rime generates vocal codes that are decoded in speech using a coding approach, which says Rime provides “faster synthesis than real time.” At launch, the time for first Audio reached 250 milliseconds, and the time of the public cloud was about 400 milliliters.

Arkan has been trained in three stages:

  • Pre -Training: RIME used large open source language models as an external column and pre -training on a large group of pairs of texts to help learn public linguistic and audio patterns.
  • Conditioned to supervise with a “huge” royal data collection.
  • Special speaker refinement: Rime select the loudspeakers that he found “the most perfect” between data collection, conversations and reliability.

RIME data includes linguistic social conversation techniques (collaboration in social context such as class, sex, location), Idioilect (individual speech habits) and linguistic nuances (non -verbal aspects of communications that are compatible with speech).

The model has also been trained on the fine details of the accent, the words of the filling (those “UHS”, “UHS” and “UMS”) as well as a temporary stop, and the additional stress patterns (intonation, timing, emphasis on some logical clips) and switching multi -language code (when multi -language speakers turn back between languages).

The company followed a unique approach Collect all these data. Kalford explained that, models builders usually collect excerpts from vocal actors, then create a form to reproduce the characteristics of that person based on the introduction of the text. Or, they will get rid of the Audiobook book data.

“Our approach was completely different,” she explained. “It was, how do we create the largest royal data collection in the conversation speech? ”

To do this, Rime built his registration studio in Qimu in San Francisco and spent several months in employing Craigslist people, through the mouthwoman, or gathered themselves, their friends and their family. Instead of text conversations, they recorded natural conversations and chitchat.

Then they explained the sounds with detailed definition data, sex coding, age, accent, and the effect of speech and language. This RIME allowed 98 to 100 % resolution.

Kalford indicated that they are constantly increasing this data collection.

She said, “How do we make her look personal? You will never get there if you only use vocal actors,” she said. “We have really done the hard thing in collecting natural data. Rime’s huge sauce is that these are not actors. These are real people.”

“Success for allocation”, which creates detailed voices

Rime plans to give customers the ability to find sounds that will work better to apply them. They built a “customization” tool to allow users to test A/B with different sounds. After a specific reaction, API reports to Rime, which provides an analytical information panel that defines the best performances based on success standards.

Of course, customers have different definitions of a successful call. In the food service, this may be selling to ask for fried potatoes or additional wings.

“The goal of us is how to create an application that makes it easy for our customers to conduct these experiments themselves?” Kalford said. “Since our customers are not vocal managers, we are not. The challenge becomes how to make this customer analysis layer really intuitive.”

Other major performance indicators customers increase the caller’s willingness to speak to artificial intelligence. They found that when turning to Rime, callers are more likely to speak to the robot.

“For the first time ever, people are like,” no, you don’t need to transfer me. I am completely ready to talk to you.

Run 100 million calls per month

Rime is considered one of its customers Domino’s, Wingstop, Convert Now and Yloopo. Keleford noted that many work do a lot of work with large communication centers, and institutional developers who build interactive audio response systems (IVR).

“When we turned to Rime, we saw an immediate improvement in two numbers in the possibility of our calls.” “Working with Rime means that we solve a lot of the problems of the last tilt that appears to charge a highly influential application.”

The Yolopo CPO Ge Juefeng indicated that for the application of its very issued company, they need to build immediate confidence with the consumer. “We tested every model in the market and found that Rime’s voices turned customers to the highest rate,” he said.

Keleford said Rime already helps nearly 100 million phone calls per month. She said: “If you call Domino’s or Wingstop, there is an opportunity from 80 to 90 % to hear the Rime voice.”

Looking at the future, Rime will push more to local offers to support low cumin. In fact, they expect this, by the end of 2025, 90 % of their size will be under implementation. “The reason for this is that you will never be quickly if you run these models in the cloud,” Kalford said.

Also, Rime continues to adjust its models to address other linguistic challenges. For example, the phrases that the model never encountered, such as the great tongue of Domino. As Keleford indicated, even if the sound is dedicated and natural and responds in actual time, it will fail if he is unable to deal with the unique needs of the company.

“There are still a lot of problems that our competitors see problems in the last mile, but our customers see the problems in the first mile,” Kalford said.



https://venturebeat.com/wp-content/uploads/2023/12/cfr0z3n_surreal_a_computer_surrounded_by_human_mouths_speaking__ec99691b-8f3f-495d-a279-7e47142b569a.png?w=1024?w=1200&strip=all
Source link

Leave a Comment