A new and open source model has reached the words called DIA to challenge ElevenLabs and Openai and more

Photo of author

By [email protected]


Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more


The start of two people in the name of Fire laboratories DIA, a model of 1.6 billion text teachers, presented the TTS words designed to produce a natural dialogue directly from the text claim elevenand Google Hit Notebooklm Ai PodCast Generation.

It can also threaten absorption Openai GPT-4O-MINI-TTS.

“The Dia Rivals Notebooklm feature while exceeding ElevenLabs and the Sesame model in quality,” said Toby Kim, one of the Nari and Dia participants, said. In a post from his account on the social network X.

in Another separateKim noted that the model was designed with “zero funding”, and added it on the topic: “… We were not an Amnesty International experts from the beginning. Everything began when we fell in love with the Notebooklm podcast feature when it was released last year. We wanted more control of votes.

Kim also Al -Fadl Google To grant him and his assistant to reach the tensioner processing unit (TPUS) chips to train DIA Google search cloud.

Icon and weights DIA – Interior Models Communication Group – now available for local downloading and publishing by anyone Embroidery or Jaytab. Individual users can try to generate speech from it Embroidery space.

Advanced control and more customized features

DIA supports accurate features such as emotional tone, headphones signs, and non -verbal vocal signals – all of this from an ordinary text.

Users can mark speakers with signs such as (S1) and (S2), and include signals such as (laughter), (coughing), or (wiping the throat) to enrich the resulting dialogue with non -verbal behaviors.

These signs are properly explained by DIA through the generation – something that does not reliably support it through other available models, according to the company’s examples page.

The model is currently in English only and is not associated with the voice of any one speaker, which produces different sounds for each process unless users reach the seed of the generation or submit a sound claim. Voice adaptation, or audio cloning allows users to direct the tone of speech and semi -sound by downloading a clip sample.

Nari Labs provides an example to facilitate this process and an existing explanation for graduates so that users can try it without preparing.

Compared to eleven and sesame

Fire offers A collection of audio files, for example It was created by DIA on its website, and compared to their main rivals in the text, specifically ElevenLabs Studio and Sesame CSM-1B, and the other new Text model to speech from Oculus VR Coatsor Coator Brendan Iribe This has gone somewhat viral on X earlier this year.

Examples include side by side, Nari Labs shared how DIA outperforms the competition in several areas:

In standard scenarios, DIA directs both natural timing and non -verbal expressions better. For example, in a text program that ends with (laughs), it explains Dia and actually laughing, while EnevenLabs and Sesame Output Extruptions such as “haha”.

For example, you are Dia …

… and the same sentence that Elevenlabs Studio speaks

In multiple conversations with emotional range, Dia shows the most smooth transformations and tone shifts. One test included a dramatic, emotional, dramatic scene. DIA has made urgency and stress effectively, while competing models often flatten the birth or lost speed.

DIA is uniquely deals with non -verbal texts, such as comic exchange that includes coughing, smelling and laughing. Competitive models have failed to identify or skip these signs.

Even with a rhythmic complex content like rap words, Dia generates liquid speech, similar to the performance that maintains a pace. This contrasts with more monotonous or broken outputs from eleven Sesame 1B.

Using sound claims, DIA can extend or follow the headphone sound style to new lines. An example of using a conversation clip as a seed that explains how Dia carried sound features from the sample through the rest of the text dialogue. This feature is not strongly supported in other models.

In one set of tests, Nari Labs pointed out that the best offer for Sesame’s Sesame website has used an internal version of the model instead of the 1B checkpoint, which leads to a gap between the announced and effective performance.

The arrival of the model and technical specifications

The developers can reach DIA from Nari Labs’ Gaytap warehouse and Facial style embrace page.

The model works on PyTorch 2.0+ and Cuda 12.6 and requires about 10 GB of VRAM.

Inference on the Foundation’s graphics processing units such as NVIDIA A4000 provides about 40 code per second.

While only the current version is run on GPU, Nari plans to provide the CPU support and a quantitative version to improve access.

The start starting offers the Python Library and the Cli tool for more publishing.

DIA flexibility opens up use cases from content creation to auxiliary technologies and artificial audio comments.

Nari Labs is also developing a DIA consumer version that aims at unusual users looking to re -mix or share the created conversations. Interested users Email to an e -mail can be singing for early access.

The source is completely open

The model is distributed under a Apache 2.0 Open source licenseWhich means that it can be used for commercial purposes – which would clearly like institutions or independent applications.

Nari Labs is explicitly prohibited to use the impersonation of individuals, spreading wrong information, or engaging in illegal activities. The team encourages the responsible experimentation and took a position against unethical publishing.

Google Tpu Research Cloud, Grant Zerogpu Huging’s Face, former work on Soundstorm, Parakeet, and Decredio Audio Codec.

Nari Labs itself includes only engineers-one full-time and part-time-but they are actively invoking the contributions of society through the Discord and GitHub server.

With the clear focus on expressive quality, cloning, and open access, DIA adds a distinctive new sound to the scene of obstetric speech models.



https://venturebeat.com/wp-content/uploads/2025/04/cfr0z3n_vector_art_line_art_flat_illustration_graphic_novel_s_fb1d5b43-fc73-495d-a097-b7a9a874c81e_0.png?w=1024?w=1200&strip=all
Source link

Leave a Comment