Eleutherai launches the Amnesty International Training Data set from the text of the licensed and open field

Photo of author

By [email protected]


Eleutherai, an Amnesty International Research Organization, claims to be one of the largest licensed text groups and the open field for training artificial intelligence models.

The data group, called the common Pile V0.1, lasted about two years to complete it in cooperation with AI Startups Pool -SIDE, embrace destinations, and others, along with many academic institutions. In the size of 8 terabytes in size, the common lint V0.1 was used to train two new models for Amnesty International from Eleutherrai, V0.1-1T and Campa V0.1-2T, which Eleutherai claims to perform on models developed using non-copyright data.

Artificial intelligence companies, including Openai, involved in Judicial cases On artificial intelligence training practices, which depend on web bulldozer – including copyright -protected materials such as books and research magazines – to build models training data collections. While some AI companies have licensing arrangements with some content providers, most of them assert that the legal doctrine of the United States for fair use protects them from responsibility in cases where they are trained in publication -protected work without permission.

Eleutherai argues that these lawsuits have “decreased significantly” from artificial intelligence companies, which the organization says has harmed the field of broader artificial intelligence research by making it difficult to understand how models work and their faults.

(Publishing rights) The lawsuits did not change the practices of data sources in the training (model), but they greatly reduced the transparency companies participating in it, “Stella Bayerman, CEO of Eutere, wrote in A. Blog post On the enemy of Friday early. “The researchers have been martyred in some of the companies that we also spoke to, specifically lawsuits as a reason for not being able to issue the research they are doing in highly focused on data.”

The joint Pile V0.1, which can be downloaded from the Huging Face’s Ai Dev and Github, was created in consultation with legal experts, and it relies on sources, including 300,000 books for the digital public field by Congress Library and Internet Archive. Eleutherai also used Whisper, the speech model to the open source text, to copy the sound content.

Eleutherai claims the V0.1-1T and Comma v0.1-2T is evidence that the common lint V0.1 has been carefully coordinated enough to enable developers to build competitive models with ownership of ownership. According to Eleutherii, the models, both of which are 7 billion of the parameters and were trained on only a small part of the popular lint V0.1, and competing models such as the first Llama Ai model in Meta on coding standards and understanding of images and mathematics.

Teachers, sometimes referred to as weights, are the internal components of the artificial intelligence model that directs its behavior and answers.

“In general, we believe that the common idea that the unlicensed text pays the performance is unjustified,” Pedermann wrote in its post. “With publicly licensed public domain data, we can expect to improve the quality of the models trained in publicly licensed content.”

The joint lint V0.1 seems to be part of the voltage to correct the historical errors of Eleutherii. For years, the company issued a pile, an open set of training text that includes copyright -protected materials. Artificial intelligence companies were subjected to fire – and legal pressure – to use the pile to train models.

Eleutherai is committed to issuing frequently open data collections in cooperation with their research and infrastructure partners.



https://techcrunch.com/wp-content/uploads/2025/01/gettyimages-2163166299-170667a.jpg?w=509

Source link

Leave a Comment